Performance comparison between screen scrapers

chuboy · January 11, 2007, 9:36am

Does anyone know how the following screen scrapers perform against one
another?

ScrAPI
RubyfulSoup
HTree
Hpricot

I’m trying to write up a tool where a person enters in a URL, and I use
an AJAX call to scrape the contents of that URL for title, description,
etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

chuboy · January 11, 2007, 10:52am

On 1/11/07, Conrad C. [email protected] wrote:

etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

There was a comparision done on this list some time ago. Search for lib
names.

chuboy · January 11, 2007, 11:26am

On Thu, 11 Jan 2007 08:36:44 -0000, Conrad C. [email protected]
wrote:

etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

I don’t know about ScrAPI or HTree, but I recently blogged an informal
benchmark run between Rubyful Soup, Hpricot, and the (still
developmental)
libxml2 HTML parser binding in Libxml-ruby. It’s at:

http://cloverhead.blogspot.com/2006/12/bit-of-benchmarking.html

chuboy · January 19, 2007, 4:31pm

Conrad C. wrote:

etc. So speed is really important (I suppose, regular expressions would
be the fastest, but I need something that is tree-based and supports
HTML tidying)

Thanks
Conrad

–
Posted via http://www.ruby-forum.com/.

I haven’t used them all but Hpricot is fast (the parser is written in C
with Ragel), error tolerant and perfect for this task. Take a look at
its website for a guide on how to use it.