Hello all,
I am investigating the possibilities of screen scraping/web extraction/
automated web navigation/wrapper generation in Ruby. I have been working
with these technologies for several years, (unfortunately) in Java
and partially C/C++ only. I came to know Ruby a few months ago and I am
currently investigating the existing tools for the above tasks. Since i
have the feeling that i am not alone (this topic is brought up regularly
here, maybe not as often as the “how to create an Object from it’s
name”, but it is close to that I have summarized my findings (tools
that i have found, descriptions, examples, comparison etc.), maybe can
help someone.
http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/
You can find simple example solutions of the same problem (scraping
links from a google result page) with regular expressions, HTree+REXML,
RubyfulSoup and WWW::Mechanize.
I am planning to write more entries on this topic, involving screen
scraping from Rails, Gecko to Ruby GTK widget embedding, wrapper
generation etc. Please note that i am new to Ruby so it is possible that
my code snippets are not the most optimal yet (suggestions welcome), but
they are all tested and working.
Feedback/corrections/suggestions would be very much appreciated!
If you liked the story, you can digg it here:
http://www.digg.com/programming/Data_extraction_for_Web_2.0:_Screen_scraping_in_Ruby_Rails
Cheers,
Peter