RubyfulSoup vs Mechanize - Suprising Performance

michael · December 3, 2006, 12:32am

All,

If anyone is thinking about using either of these packages to
screen-scrape then I think you should consider mechanize as an option
over rubyfulsoup.

I was using rubyfulsoup to scrape html pages via a batch process where
performance didn’t matter too very much. I needed to port the
functionality into a user process where performance did become an issue.
RubyfulSoup was taking about 30 seconds to initialize/load the page
prior to any processing being done on the page. This was unacceptable
for the user process.

I started looking into other options. SCRAPI was one option that seemed
really promising but I couldn’t find enough documentation on it to make
much headway. It may be a good option for others who are more familiar
with CSS Selectors, but that person isn’t me.

I then looked into WWW::Mechanize. Most of the reading I found on the
internet was related to using this for filling out forms and posting
data. It was hard to find good examples for parsing out text values,
etc… but this turned out to be a great option. WWW::Mechanize uses
hpricot for querying the html document with xpath or css selectors.

In my opinion, RubyfulSoup is much easier to learn and use initially.
However, WWW::Mechanize is MUCH faster - at least for my needs. The
page that was taking over 30 seconds to load into rubyfulsoup takes just
a few seconds to load into mechanize (and this is the amount of time it
takes to pull it down from the source url).
Parsing/searching/extracting is extremely fast and solved my performance
problems. I already knew xpath query statements so it was pretty easy.

Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

Regards,

Michael

michael · December 3, 2006, 2:27am

Michael wrote:

prior to any processing being done on the page. This was unacceptable
etc… but this turned out to be a great option. WWW::Mechanize uses
Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

I was using regular expressions for some page-scraping, then found out
about RubyfulSoup. It seemed like the “proper” way to do things, but I
had to abandon it because, for my application, it was intolerably slow.
I have to deal with hundreds or thousands of pages, and if the parsing
takes much longer than the fetching (over a 0.5Mbit/s connection) that’s
no good for me.

regards

Justin F.

michael · December 3, 2006, 3:39am

Michael wrote:

If anyone is thinking about using either of these packages to
screen-scrape then I think you should consider mechanize as an option
over rubyfulsoup.

At a guess, I would use…

wget to pull down the page
tidy to convert it to XHTML
XPath from libxml or similar high-end parser

All three engines are written in a C language, not our beloved Ruby.

And no Perl, either…

–
Phlip
Redirecting... ← NOT a blog!!!

michael · December 3, 2006, 8:48am

Michael,

You may spend a little time evaluating hpricot on your data:
http://code.whytheluckystiff.net/hpricot/

It’s easy to learn and faster than rubyfulsoup (from the benchmarks I
found through Google).

Alain

michael · December 3, 2006, 3:44pm

Yup, hpricot rocks.

Vish

michael · December 3, 2006, 4:13pm

Alain R. wrote:

Michael,

You may spend a little time evaluating hpricot on your data:
http://code.whytheluckystiff.net/hpricot/

It’s easy to learn and faster than rubyfulsoup (from the benchmarks I
found through Google).

Alain

Alain,

I guess you didn’t read my post closely enough!! I found that Mechanize
is way faster than RubyfulSoup and I stated that Mechanize uses hpricot
for parsing! So…I have already spent a little time evaluating it
and it was the purpose of my post - to save others from going down the
slower path.

Thanks,

Michael

michael · December 3, 2006, 4:19pm

Justin F. wrote:

Michael wrote:

prior to any processing being done on the page. This was unacceptable
etc… but this turned out to be a great option. WWW::Mechanize uses
Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

I was using regular expressions for some page-scraping, then found out
about RubyfulSoup. It seemed like the “proper” way to do things, but I
had to abandon it because, for my application, it was intolerably slow.
I have to deal with hundreds or thousands of pages, and if the parsing
takes much longer than the fetching (over a 0.5Mbit/s connection) that’s
no good for me.

regards

Justin F.

Justin,

The parsing with mechanize is extremely fast!

Michael

michael · December 3, 2006, 9:01pm

Michael wrote:

Justin F. wrote:

Michael wrote:

prior to any processing being done on the page. This was unacceptable
etc… but this turned out to be a great option. WWW::Mechanize uses
Hopefully someone else can benefit from this before investing a lot of
time in rubyfulsoup just to find that it may have performance issues.

Justin,

The parsing with mechanize is extremely fast!

Michael

Thanks, I’ll take a look.

Justin

michael · December 4, 2006, 12:05am

Is there a ruby solution to spider and scrape javascript formed pages,
like when a form and it’s options are made with javascript; I have a
job where I have to spider and scrape javascript built pages, wish I
can do it via ruby solution. Any suggestions?

michael · December 4, 2006, 3:04am

On 12/3/06, Victor R. [email protected] wrote:

Is there a ruby solution to spider and scrape javascript formed pages,
like when a form and it’s options are made with javascript; I have a
job where I have to spider and scrape javascript built pages, wish I
can do it via ruby solution. Any suggestions?

I’ve never used it, but I’ve seen a Ruby extension:
http://raa.ruby-lang.org/project/ruby-js/