Screen-scraping

The number of monthly instances where I have to announce to various
people
that yes, I’m going to work 40 hours a week as I always do, and lo and
behold, even on the project I’m assigned to, just like all the months
before, has finally reached into the two-digit numbers.

Since that also involves interacting with a slew? horde? school? mob? of
webapps, at various levels of being enterprisey (read: laborious and
unhelpful), I finally broke and decided to hack myself a set of
screenscrapers to invoke at the proper times do fill in cookie-cutter
values for me.

Eeexcept I never did any screenscraping before - so I’m on the lookout
for
a toolkit recommendation. So far, WATIR seems the most comprehensive,
but
I’m afraid it might be overkill or plain unsuitable for this task (being
an acceptance testing tool). My criteria are being able to handle the
fact
most of these apps are clickfests of ASP / ASP.NET provenience, and the
HTML source code could probably scare small children, so I’d like to
have
the toolkit handle most of the textmunging.

So. Discuss :stuck_out_tongue_winking_eye: (Thanks in advance for any advice.)

David V.
Suffering the Death of a Thousand Papercuts

On Feb 19, 2007, at 11:31 AM, David V. wrote:

fill in cookie-cutter values for me.
I wholeheartdly understand you.

Eeexcept I never did any screenscraping before - so I’m on the
lookout for a toolkit recommendation. So far, WATIR seems the most
comprehensive, but I’m afraid it might be overkill or plain
unsuitable for this task (being an acceptance testing tool). My
criteria are being able to handle the fact most of these apps are
clickfests of ASP / ASP.NET provenience, and the HTML source code
could probably scare small children, so I’d like to have the
toolkit handle most of the textmunging.

I’d recommend Watir, forget it can be applied to testing, it is the
easy and more robust way to do screen-scrapping, because it delegates
all the parsing, JavaScript, etc. to a real browser. The drawback is
that you use an actual instance of IE/Firefox/Safari, but for some
applications like this one that is not an issue, the easy of use
weights more. The code will be easy, and it will work with your
lovely Project Central-kind of enterprisey thing.

– fxn

So. Discuss :stuck_out_tongue_winking_eye: (Thanks in advance for any advice.)

Well, there is a new kid on the block, scRUBYt! (DISCLAIMER: I am the
author
so I may be biased a bit ;-), a web scraping framework based on
Mechanize and Hpricot. I am planning to add (or replace? not sure yet)
Mechanize with WATIR, so that it can handle javascript, too. If you can
do without javascript for a moment, I think scRUBYt! is an interesting
choice, because:

  1. Mechanize and Hpricot are super great in themselves - now sum the
    power two, multiply it by n (you decide the value of n - for all the
    people so far I got feedback from it was much greater than 1 :slight_smile: because
    of the added functionality …

  2. scRUBYt! is easy to learn and use, quite powerful, has tons of docs
    (check out http://scrubyt.org), nicely documented
    (http://scrubyt.rubyforge.org), unit tested, blackbox tested etc. API
    and the whole thing is designed to by extendable by your stuff - and I
    am usually available for support if this is still not enough.

  3. I am planning to invest a lot of time into scRUBYt! - I am just
    releasing the next version as I write this mail, my TODO list has about
    200+ items and the community seems to be very active, so I got already
    tons bug reports, feat requests and even patches (and the whole thing is
    out for about 2 weeks)

  4. I am planning to launch a community site where (hopefully) the users
    will upload, tag, rate etc. the extractors they create - so this can be
    also an interesting thing if it works out.

A quick example:

=====================================================================
amazon_stuff = Scrubyt::Extractor.define do

fetch ‘http://www.amazon.com
fill_textfield ‘field-keywords’, ‘logitech keyboard’
choose_option ‘url’, ‘Computers & PC Hardware’
submit

stuff do
item_name “Logitech diNovo Edge ( 967685-0403 )”
price “$169.98”
end
end

amazon_stuff.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(amazon_stuff)

output:

[MODE] learning
[ACTION] fetching document: http://www.amazon.com
[ACTION] typing logitech keyboard into the textfield named
‘field-keywords’
[ACTION] selecting option Computers & PC Hardware from the option list
‘url’
[ACTION] submitting form…
[ACTION] fetched
Amazon.com : logitech keyboard

Logitech diNovo Edge ( 967685-0403 ) $169.98 Logitech G15 Gaming Keyboard $77.74 Logitech Media Keyboard Elite- Black ( 967559-0403 ) $27.43 Logitech Cordless Desktop S510 $52.79 Logitech Cordless Desktop MX 3000 Laser (967553-0403) $60.93 Logitech Classic Keyboard $11.99 Logitech Cordless Desktop LX 300 $38.74 Logitech diNovo Cordless Desktop $104.99 Logitech Cordless Desktop MX 5000 Laser (967558-0403) $116.99 Logitech Media Keyboard Logitech Cordless Desktop MX3200 Laser $76.98 Logitech G11 Gaming Keyboard $61.73 Logitech Cordless Desktop S 530 Laser for Mac ( 967664-0403 ) $67.94 Logitech Cordless Desktop Comfort Laser $77.81 Logitech Cordless Desktop EX110 ( 967561-0403 ) Used & new from $24.97 Sony Playstation 2 USB Keyboard
 stuff extracted 16 instances.
     item_name extracted 16 instances.
     price extracted 14 instances.

I think you get the idea… scRUBYt! hides all the ugly stuff (HTML,
XPats, form names, whatnot) and figures out everything based on your
examples.

btw. don’t try to run this example with 0.2.0 (the current version which
is out), it needs 0.2.3 which I am going to release in a few hours.

scRUBYt! has much more features than this example suggests - if you are
interested, check out http://scrubyt.org.

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

On Mon, 19 Feb 2007 11:49:51 +0100, Xavier N. [email protected]
wrote:

I’d recommend Watir, forget it can be applied to testing, it is the
easy and more robust way to do screen-scrapping, because it delegates
all the parsing, JavaScript, etc. to a real browser. The drawback is
that you use an actual instance of IE/Firefox/Safari, but for some
applications like this one that is not an issue, the easy of use weights
more. The code will be easy, and it will work with your lovely Project
Central-kind of enterprisey thing.

Right, so Watir -can- be applied to more gory screenscraping too. The
fact
that it requires IE is more of a benefit now that I think about it (I
forgot about the WATIR architecture since I first read about it in the
author’s first “I might or might not release this” announcement), I’m
fairly sure at least one of those apps won’t work in Opera at all.

Thanks!

David V.

Hmm.

On Mon, 19 Feb 2007 12:01:01 +0100, Peter S.
[email protected]
wrote:

If you can do without javascript for a moment,

Having since dared the HTML, it seems at least for some of the apps, not
really.

  1. Mechanize and Hpricot are super great in themselves

I never actually used either, I can only vaguely guess at the scope -
Hpricot doing the low-level parsing and cleanup, Mechanize the
higher-level data extraction from the result of that.

  1. I am planning to invest a lot of time into scRUBYt! - I am just
    releasing the next version as I write this mail, my TODO list has about
    200+ items and the community seems to be very active, so I got already
    tons bug reports, feat requests and even patches (and the whole thing is
    out for about 2 weeks)

Woo, new anal retention sink? (COWER BRIEF MORTALS.) Who knows, I might
even get around to actually fixing other peoples’ bugs in my spare time
when I’m not suppressing homicidal tendencies from doing so at worktime.
(Although I expect my spare hacking time will be spent coding said
screenscrapers in the nearby future.)

amazon_stuff = Scrubyt::Extractor.define do

fetch ‘http://www.amazon.com
fill_textfield ‘field-keywords’, ‘logitech keyboard’
choose_option ‘url’, ‘Computers & PC Hardware’
submit

I like this API.

stuff do

Where’d the stuff variable come from?

 item_name "Logitech diNovo Edge ( 967685-0403 )"

I love and adore my dNE too :stuck_out_tongue_winking_eye:

 price "$169.98"

end
end

amazon_stuff.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(amazon_stuff)

Right, I suppose it goes on the List of Things To Try on the saner of
the
webapps. And after that Excel automation for the paperwork done -that-
way
(unsurprisingly the most laborious of them all.)

David V.

Having since dared the HTML, it seems at least for some of the apps, not
really.
Well, it is more about the navigation part: if you would like to scrape
a page
where you have to login first, and the login uses JS (like google
analytics for example), you can not do it with Mechanize (O.K. you can
workaround JS and login through a plain old HTML page in the case of
google pages, but let’s suppose there is no ye good olde HTML login
possibility)

I never actually used either, I can only vaguely guess at the scope -
Hpricot doing the low-level parsing and cleanup, Mechanize the
higher-level data extraction from the result of that.
Not exactly. Mechanize is used to do the navigation (login, click this,
fill that, dont’t touch those, submit form etc) - so it get’s you to the
page where you would like to actually do the scraping (in scRUBYt!,
those are the fetch, fill_textfield etc. commands).

Once you arrive at the page of your interest, you can forget about
Mechanize: Hpricot takes on from this point. scRUBYt! figures out what
are you up to, turns it into XPath, regexps and that sort of stuff, then
it hands over to Hpricot to evaluate all these.

amazon_stuff = Scrubyt::Extractor.define do

fetch ‘http://www.amazon.com
fill_textfield ‘field-keywords’, ‘logitech keyboard’
choose_option ‘url’, ‘Computers & PC Hardware’
submit

I like this API.
great!

Where’d the stuff variable come from?
from the scaper’s creator. I could have written funky_ooze and the
difference would be that in the XML output you would see <funky_ooze>
tags instead of tags.

So these can be arbitrary, they are just used to hold your results. The
structure is more important: the fact that the other two things (called
actually patterns in scRUBYt! terminology) ‘item_name’ and ‘price’ are
passed as a block to it describes that they are logically stuff’s
children. This means that item_name’s and price’s input is stuff’s
output.

Right, I suppose it goes on the List of Things To Try on the saner of
the webapps. And after that Excel automation for the paperwork done
-that- way (unsurprisingly the most laborious of them all.)
Great! Feedback is highly appreciated so LMK how it goes or if you are
stuck with something etc.

Cheers,
Peter

__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.