Hpricot/Rubyful Soup comparison

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

Wes G. wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and
I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot. I am absolutely sure about this. I am doing things
with HPricot which should be extremely slow (e.g. traversing the whole
tree and doing expensive operations on all Hpricot::Elements) yet
HPricot is surprisingly fast. Rubyful is nowhere near.

b) preserves the original HTML better.
Hmm this I don’t know, but I guess the term ‘preserves HTML better’
should be defined first with some metrics or something ( deviance from
the HTML standard? ). There are a lot of so badly formed HTML pages,
than even a human would come up with multiple solutions for their
correction.

I think the only real-life quality meter is to process your pages with
both of them and see which one yields better results. I did not play too
much with RubyfulSoup but I am writing a quite serious screen scraping
framework based on Hpricot, and so far I had no real problems - and I am
doing every kind of weird things.

Cheers,
Peter

__
http://www.rubyrailways.com

On 11/22/06, Peter S. [email protected] wrote:

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

Also, RubyfulSoup aims to be very resilient to malformed markup, so it
must resort to heuristics that have a performance cost. I don’t know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on, but in my experience RubyfulSoup
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

Cheers,

Luciano

Luciano R. wrote:

On 11/22/06, Peter S. [email protected] wrote:

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.
true

Also, RubyfulSoup aims to be very resilient to malformed markup,
So it’s HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML ‘somehow’. We can argue
whether HPricot’s ‘somehow’ is better or worse that RubyfulSoup’s, but
it is a fact that HPricot is handling malformed pages very well.

so it

must resort to heuristics that have a performance cost. I don’t know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on,
This concretely is absolutely OK. Maybe we would need a list of serious
problems and see how Hpricot vs RubyfulSoup is handling them. From what
I have seen, HPricot did not have any problems with any page…

has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.
I am still not sure what are the added benefits of RubyfulSoup parsing
over HPricot (although I am not claiming that there are none) - I would
like to see a real serious comparison to decide this…

Peter

__
http://www.rubyrailways.com

On 21-Nov-06, at 5:27 PM, Wes G. wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster
for an
average sized HTML page and b) preserves the original HTML better.

I switched from Rubyful Soup to Hpricot a while ago. The reason was
performance on 1000-2000 character html chunks – I didn’t do a
benchmark because there just was no need to… Hpricot is a lot
faster.

I have no idea which preserves html better, I’m only using them to
find specific bits of the html (e.g. links, images, a few other
things). I do not use either to transform the input html, I always
keep the input as it was. In all cases I have html in a string that I
give to the parser, I do know that with Rubyful Soup it was
absolutely necessary to dup the string first or you were liable to
have changes made to the input string.

Cheers,
Bob

Thanks,
Wes


Posted via http://www.ruby-forum.com/.


Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

On 11/22/06, Peter S. [email protected] wrote:

Also, RubyfulSoup aims to be very resilient to malformed markup,
So it’s HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML ‘somehow’. We can argue
whether HPricot’s ‘somehow’ is better or worse that RubyfulSoup’s, but
it is a fact that HPricot is handling malformed pages very well.

Thanks for the input, Peter. From your opinion and other´s, it seems
HPricot is the best option. Coming from Python, I was used to
BeautifulSoup, from which RubyfulSoup derived, and I was very happy
with it. But if we can have the same benefits with better performance,
then it´s a no-brainer!

Cheers,

Luciano

Thanks for all of the comments.

I was pretty sure that Hpricot was faster since it is partially written
in C, but it’s nice to hear a resounding “YES” on that topic.

My concern about “preserving original markup” has to do with this
application I’m writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

Some malformed HTML is handled fine by browsers, so I’d like to preserve
the original HTML regardless of its quality. If Hpricot will not only
parse my HTML quickly, but also not fix the HTML on the way out (dumping
the parse tree), that would be ideal.

Again, thanks for all of the discussion - it’s quite helpful.

Wes

On Thu, Nov 23, 2006 at 12:28:30AM +0900, Wes G. wrote:

My concern about “preserving original markup” has to do with this
application I’m writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

  • to_html which outputs fully closed tags and strips out bogus end
    tags.
  • to_original_html which outputs the original document (as close as
    it can)
    with your modifications made.

So, for example, I use the to_original_html method in MouseHole, which
is a
scriptable personal HTTP proxy (sort of like greasemonkey). Some pages
(like
Boing Boing, for instance) completely break if you try to fix up the
HTML. But
this new method can successfully remove stuff and alter stuff without
turning the
whole page upside-down.

_why

On Wed, Nov 22, 2006 at 09:40:33PM +0900, Gregory S. wrote:

I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. ) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

Mmm. Okay, good point. So if a tag comes in as self-closing, keep it
that way?
I think that’s reasonable.

_why

On Wed, Nov 22, 2006 at 08:03:54PM +0900, Peter S. wrote:
} Luciano R. wrote:
} > On 11/22/06, Peter S. [email protected] wrote:
[…]
} > Also, RubyfulSoup aims to be very resilient to malformed markup,
} So it’s HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML ‘somehow’. We can
argue
} whether HPricot’s ‘somehow’ is better or worse that RubyfulSoup’s, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I
don’t
} > know fow HPricot handles HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of
serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From
what
} I have seen, HPricot did not have any problems with any page…

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken
HTML
if that’s your intent, but since I am mostly using HPricot for
sanitization
I just set the close tags to nil so the output closes with the correct
tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. ) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need
to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I
would
} like to see a real serious comparison to decide this…

I haven’t tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects.
It
doesn’t try to behave like a real DOM, which means that it can use
arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
–Greg

_why wrote:

I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

  • to_html which outputs fully closed tags and strips out bogus end
    tags.
  • to_original_html which outputs the original document (as close as
    it can)
    with your modifications made.

sweet.

Wes G. wrote:

_why wrote:

I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

  • to_html which outputs fully closed tags and strips out bogus end
    tags.
  • to_original_html which outputs the original document (as close as
    it can)
    with your modifications made.

sweet.

Actually, I’m kind of hoping that I can make mods. to the parse tree,
but that no “unnecessary fixing” of bad HTML occurs.

So I’m wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

Thanks,
Wes

On Thu, Nov 23, 2006 at 03:57:51AM +0900, Wes G. wrote:

Actually, I’m kind of hoping that I can make mods. to the parse tree,
but that no “unnecessary fixing” of bad HTML occurs.

So I’m wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

With to_original_html, no malformed HTML is fixed.

require ‘hpricot’
doc = Hpricot(“

Paragraph one

Paragraph two with
some
tags in it <b etc.=>

”)

(doc/:p).set(‘class’, ‘new’)
puts doc.to_original_html

Paragraph one

Paragraph two with some tags in it

With to_html, Hpricot will line up all the tags.

_why

I recently wrote a scrapper in rubyfulsoup and then rewrote it in
hpricot. The hpricot version was MUCH faster, had less code and is
easier to understand. I was a bit dubious of hpricot initially
because of the ‘strange syntax’ but I am definitely sold now.

As for correctness, I can’t comment.

I’ve used both Hpricot and Rubyful Soup to parse the Google News page
and found Hpricot to be much faster.

Luis

I have, in late August, and at that time, we found that Rubyful Soup
was ten times slower than Hpricot and Mechanize.

On Tue, 21 Nov 2006 22:27:15 -0000, Wes G. [email protected] wrote:

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

I recently did a small head-to-head with RubyfulSoup, Hpricot, and the
up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding
to
the libxml2 HTML parser. Running against the RubyfulSoup homepage
(perhaps
ironically, it’s pretty badly formed) over 100 iterations, the attached
benchmark gave out the following results. Each benchmark is parsing the
original HTML and then getting back a specific node set (Hpricot and
libxml2 using Xpath, RubyfulSoup using it’s own query API):

                               user     system      total 

real
rubyful soup - simple 25.900000 0.710000 26.610000 (
26.669350)

                               user     system      total 

real
rubyful soup - trickier 26.220000 0.010000 26.230000 (
26.252975)

                               user     system      total 

real
hpricot - simple xpath 7.930000 0.000000 7.930000 (
7.950092)

                               user     system      total 

real
hpricot - trickier xpath 8.200000 0.010000 8.210000 (
8.212230)

                               user     system      total 

real
libxml2 - simple xpath 0.900000 0.000000 0.900000 (
0.899329)

                               user     system      total 

real
libxml2 - trickier xpath 0.940000 0.000000 0.940000 (
1.217441)

In terms of preserving the original HTML, I found the libxml2 and
Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML. There were minor differences in the XML produced, and from
a
(biased, nitpicking) spec point of view I think libxml2’s output is
slightly more ‘proper’ (self closing tags, etc). RubyfulSoup on the
other
hand seemed to have a few inconsistencies - it would occasionally lose
tag
attributes, and sometimes return varying results to the same query.

As for feature support, well, I don’t want to rain on anyone’s parade
but
the libxml HTML parser outputs an XML::Document with which you can
transparently use all of libxml2’s (many) features … :wink: I couldn’t get
XPath functions to work with Hpricot, but then I’m not sure how complete
an XPath implementation it’s aiming for, and apart from that it seems
pretty solid. OTOH RubyfulSoup has no Xpath support at all :frowning:

On Sat, Nov 25, 2006 at 04:50:07AM +0900, Ross B. wrote:

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That’s
sensational. Are the bindings pretty stable?

_why

On Sat, 2006-11-25 at 09:00 +0900, _why wrote:

On Sat, Nov 25, 2006 at 04:50:07AM +0900, Ross B. wrote:

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That’s
sensational. Are the bindings pretty stable?

Surely does: HTMLparser: interface for an HTML 4.0 non-verifying parser . It’s a new
addition to the bindings (still in CVS) but it’s really ‘just another
parser’ and uses the same (reasonably well tested) parser context / tree
bindings as the regular XML parsers.