Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::


#1

Hello all,

I am investigating the possibilities of screen scraping/web extraction/
automated web navigation/wrapper generation in Ruby. I have been working
with these technologies for several years, (unfortunately) in Java
and partially C/C++ only. I came to know Ruby a few months ago and I am
currently investigating the existing tools for the above tasks. Since i
have the feeling that i am not alone (this topic is brought up regularly
here, maybe not as often as the “how to create an Object from it’s
name”, but it is close to that :wink: I have summarized my findings (tools
that i have found, descriptions, examples, comparison etc.), maybe can
help someone.

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

You can find simple example solutions of the same problem (scraping
links from a google result page) with regular expressions, HTree+REXML,
RubyfulSoup and WWW::Mechanize.

I am planning to write more entries on this topic, involving screen
scraping from Rails, Gecko to Ruby GTK widget embedding, wrapper
generation etc. Please note that i am new to Ruby so it is possible that
my code snippets are not the most optimal yet (suggestions welcome), but
they are all tested and working.

Feedback/corrections/suggestions would be very much appreciated!

If you liked the story, you can digg it here:

http://www.digg.com/programming/Data_extraction_for_Web_2.0:_Screen_scraping_in_Ruby_Rails

Cheers,
Peter


#2

On Jun 14, 2006, at 5:07 AM, Peter S. wrote:

http://www.rubyrailways.com/data-extraction-for-web-20-screen-
scraping-in-rubyrails/

This was a very good article. Thank you for sharing it with us.

Please note that i am new to Ruby so it is possible that
my code snippets are not the most optimal yet (suggestions welcome),

Well, you sometimes declare variables inThisStyle, but Rubyists use
this_style_here.

James Edward G. II


#3

James Edward G. II wrote:

On Jun 14, 2006, at 5:07 AM, Peter S. wrote:

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

This was a very good article. Thank you for sharing it with us.

Thx!

Well, you sometimes declare variables inThisStyle, but Rubyists use
this_style_here.

Thanks for the suggestion, i’ll update it ASAP. (Coming from the Java
camp, that’s why the camelsAreStillHaunting :wink:

Peter