So. Discuss (Thanks in advance for any advice.)
Well, there is a new kid on the block, scRUBYt! (DISCLAIMER: I am the
author
so I may be biased a bit ;-), a web scraping framework based on
Mechanize and Hpricot. I am planning to add (or replace? not sure yet)
Mechanize with WATIR, so that it can handle javascript, too. If you can
do without javascript for a moment, I think scRUBYt! is an interesting
choice, because:
-
Mechanize and Hpricot are super great in themselves - now sum the
power two, multiply it by n (you decide the value of n - for all the
people so far I got feedback from it was much greater than 1 because
of the added functionality …
-
scRUBYt! is easy to learn and use, quite powerful, has tons of docs
(check out http://scrubyt.org), nicely documented
(http://scrubyt.rubyforge.org), unit tested, blackbox tested etc. API
and the whole thing is designed to by extendable by your stuff - and I
am usually available for support if this is still not enough.
-
I am planning to invest a lot of time into scRUBYt! - I am just
releasing the next version as I write this mail, my TODO list has about
200+ items and the community seems to be very active, so I got already
tons bug reports, feat requests and even patches (and the whole thing is
out for about 2 weeks)
-
I am planning to launch a community site where (hopefully) the users
will upload, tag, rate etc. the extractors they create - so this can be
also an interesting thing if it works out.
A quick example:
=====================================================================
amazon_stuff = Scrubyt::Extractor.define do
fetch ‘http://www.amazon.com’
fill_textfield ‘field-keywords’, ‘logitech keyboard’
choose_option ‘url’, ‘Computers & PC Hardware’
submit
stuff do
item_name “Logitech diNovo Edge ( 967685-0403 )”
price “$169.98”
end
end
amazon_stuff.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(amazon_stuff)
output:
[MODE] learning
[ACTION] fetching document: http://www.amazon.com
[ACTION] typing logitech keyboard into the textfield named
‘field-keywords’
[ACTION] selecting option Computers & PC Hardware from the option list
‘url’
[ACTION] submitting form…
[ACTION] fetched
Amazon.com : logitech keyboard
Logitech diNovo Edge ( 967685-0403 )
$169.98
Logitech G15 Gaming Keyboard
$77.74
Logitech Media Keyboard Elite- Black ( 967559-0403
)
$27.43
Logitech Cordless Desktop S510
$52.79
Logitech Cordless Desktop MX 3000 Laser
(967553-0403)
$60.93
Logitech Classic Keyboard
$11.99
Logitech Cordless Desktop LX 300
$38.74
Logitech diNovo Cordless Desktop
$104.99
Logitech Cordless Desktop MX 5000 Laser
(967558-0403)
$116.99
Logitech Media Keyboard
Logitech Cordless Desktop MX3200 Laser
$76.98
Logitech G11 Gaming Keyboard
$61.73
Logitech Cordless Desktop S 530 Laser for Mac (
967664-0403 )
$67.94
Logitech Cordless Desktop Comfort Laser
$77.81
Logitech Cordless Desktop EX110 ( 967561-0403
)
Used & new
from $24.97
Sony Playstation 2 USB Keyboard
stuff extracted 16 instances.
item_name extracted 16 instances.
price extracted 14 instances.
I think you get the idea… scRUBYt! hides all the ugly stuff (HTML,
XPats, form names, whatnot) and figures out everything based on your
examples.
btw. don’t try to run this example with 0.2.0 (the current version which
is out), it needs 0.2.3 which I am going to release in a few hours.
scRUBYt! has much more features than this example suggests - if you are
interested, check out http://scrubyt.org.
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.