Hpricot problem

nock12 · December 17, 2006, 10:54am

Not sure where to send this, sorry if it’s not the right place…

The html in the attached file renders ‘correctly’ in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see ‘Some text’. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

Henry

nock12 · December 17, 2006, 5:18pm

On 12/17/06, Henry M. [email protected] wrote:

I have found that saved pages from Firefox are different then the html
hpricot uses.

You can also check out the tickets for hpricot at
http://code.whytheluckystiff.net/hpricot/report/1

Stephen B. IV

nock12 · December 17, 2006, 5:56pm

I have found that saved pages from Firefox are different then the html
hpricot uses.

Of course they are. If you save a page from Firefox, (or IE, or just any
browser) the page gets saved as-it-is (i.e. as the author put it on the
web server - with all the errors, non-conforming, unclosed or otherwise
malformed tags, etc.)

Now, what happens when a browser renders the page? It builds a Document
Object Model (DOM) out of the HTML and renders the DOM. This DOM
conforms to strict rules (i.e. no wild-wild-west HTML crapfest). If you
would dump it to a file as an XML, you would have a correct XHTML page,
which would resemble to the original HTML as much as the browser’s DOM
building rules make this possible (generally very close to standards in
the case of Mozilla and Opera, crappy in the case of IE (at least before
6, I am not sure about 7)).

What Hpricot does is very similar: It builds a DOM of the HTML. (I am
not sure if _why calls this a DOM or whatever, but it is an internal
representation of the underlying HTML). Of course Mozilla DOM != HPricot
DOM (!= IE DOM != Opera DOM et cetera) therefore you can’t make
assumptions about what does Hpricot do based on what does Mozilla do.

If you want Mozilla to parse your page and return the DOM (or serialize
it to XML so you can feed it to an XML/XSLT/XPath engine), I can show
you how, but only in Java - unfortunately Ruby’s tools are not yet
there.

Or, you can use Hpricot and forget about how it works in everywhere
else…

Cheers
Peter

__
http://www.rubyrailways.com

nock12 · December 18, 2006, 7:14am

On Mon, Dec 18, 2006 at 01:55:26AM +0900, Peter S. wrote:

I have found that saved pages from Firefox are different then the html
hpricot uses.

What Hpricot does is very similar: It builds a DOM of the HTML. (I am
not sure if _why calls this a DOM or whatever, but it is an internal
representation of the underlying HTML). Of course Mozilla DOM != HPricot
DOM (!= IE DOM != Opera DOM et cetera) therefore you can’t make
assumptions about what does Hpricot do based on what does Mozilla do.

Well, but, I’d actually like to get Hpricot’s parser to be close to
Firefox’s.
So, what I’m saying is: if Hpricot appears to read HTML differently from
Firefox, I’d say that’s a bug. Yep, for sure it is.

_why

nock12 · December 18, 2006, 10:22am

Well, but, I’d actually like to get Hpricot’s parser to be close to Firefox’s.

Wow. Wow. Wow.
I have thought you can’t possibly tell me something new about Hpricot
that will make me a relevantly bigger fanboy, but once again, I was
proven wrong.

Keep up the great work.

Cheers,
Peter

btw. What about the XPath indices? Have you decided for indexing from 0?
__
http://www.rubyrailways.com