HTML dom

tabassman · June 23, 2009, 5:55pm

Hi,

I’m trying to build a HTML page indexer in ruby and I’d like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive or is there another library
that does HTML DOM and XPath?

tabassman · June 23, 2009, 7:16pm

On 23.06.2009 17:55, Victor T. wrote:

I’m trying to build a HTML page indexer in ruby and I’d like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive

No.

or is there another library
that does HTML DOM and XPath?

Nokogiri and Hpricot seem to be the most popular.

Cheers

robert

tabassman · June 23, 2009, 7:26pm

On Jun 23, 8:55 am, Victor T. [email protected]
wrote:

Hi,

I’m trying to build a HTML page indexer in ruby and I’d like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you’re used to using the DOM. IMHO.

Is there a way to make REXML more permissive or is there another library

There’s libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

Have you tried using REXML’s SAX2 parser? I think it would be better
suited for your problem.

-Skye

tabassman · June 23, 2009, 10:51pm

On 23.06.2009 19:24, Skye Shaw!@#$ wrote:

On Jun 23, 8:55 am, Victor T. [email protected]
wrote:

Hi,

I’m trying to build a HTML page indexer in ruby and I’d like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you’re used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other
XML libs. You can even use XPath queries.

Is there a way to make REXML more permissive or is there another library

There’s libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

libxml won’t help as Victor is not processing XML.

Have you tried using REXML’s SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML. While many
web pages are valid XML not all are due to the history of browser
development. Thus it’s better to use a tool suited to the job, i.e.
capable of parsing HTML which is not valid XML.

Kind regards

robert

tabassman · June 23, 2009, 11:48pm

On Wed, Jun 24, 2009 at 05:50:47AM +0900, Robert K. wrote:

Have you tried using REXML’s SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML. While many
web pages are valid XML not all are due to the history of browser
development. Thus it’s better to use a tool suited to the job, i.e.
capable of parsing HTML which is not valid XML.

The libxml2 c library has contained a correcting HTML processor since
it’s
first release in April 2000. libxml2 is quite capable of processing
broken HTML.

libxml-ruby and nokogiri both provide a ruby API for libxml2.

tabassman · June 23, 2009, 7:59pm

Thanks for the help. I’ve decided to go for Hipricot and it works rather
well now. Don’t know why but for some reason I was reluctant to go for
that. Anyway it’s great… I love it. It feels like jQuery

tabassman · June 24, 2009, 12:33am

On Jun 23, 4:48 pm, Robert K. [email protected] wrote:

libxml won’t help as Victor is not processing XML.

Whoa… and right after you recommended the libxml-based Nokogiri.

I have been using libxml2 (in various forms) for years to parse HTML.
I find it to be the best HTML parser out there. It’s also completely
XPath 1.0 compliant–my XPaths tend to break in Hpricot.

Both libxml-ruby and Nokogiri have similar functionality. I like the
Nokogiri API a little better.

– Mark.

tabassman · June 24, 2009, 7:45am

On 24.06.2009 00:25, Mark T. wrote:

On Jun 23, 4:48 pm, Robert K. [email protected] wrote:

libxml won’t help as Victor is not processing XML.

Whoa… and right after you recommended the libxml-based Nokogiri.

:-} Sorry, I did not knew that Nokogiri was based on libxml. Thanks to
you and Aaron for the update! Skye seemed to suggest XML tools only
which are clearly not suited for the job. I’ll shut up now.

Kind regards

robert

tabassman · June 25, 2009, 8:50am

On Jun 23, 1:48 pm, Robert K. [email protected] wrote:

Yes, REXML can be awkward if you’re used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other
XML libs. You can even use XPath queries.

Not sure what you mean by similar. Similar in that there is a tree of
elements that can be manipulated, but not similar to anything called
DOM.

In REXML, an Element is an REXML::Element; which is a REXML::Parent
which is a REXML::Child (huh?) which includes REXML::Node.
There is no NodeList, createTextNode(), getElementById(), etc…

To get an element by its ID, I’d have to say something like:

my_document.root.elements(“//@id[‘crap’]”).each { #do something with
crap }

I would have liked to been able to use the DOM when using REXML,
unfortunately REXML doesn’t really support it.

Is there a way to make REXML more permissive or is there another library

There’s libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

libxml won’t help as Victor is not processing XML.

That should be fine.

Have you tried using REXML’s SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML.

Your right. He should never have been using REXML.

-Skye