Simple crawling/HTML extraction

splattael · May 15, 2006, 10:22am

Hello all,

I would like to accomplish the following task:

I have a bunch of links. Visit every link one by one
For every visited link, extract a string with a regexp from the page
1. can be recursive: If one/more of the extracted string(s) is
  a link, visit it, extract strings etc. etc. - of course this should
  stop after a specified level to prevent endless recursion
At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?

Cheers,
Peter

splattael · May 15, 2006, 8:19pm

On Mon, May 15, 2006 at 05:21:27PM +0900, Peter S. wrote:

At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?

I suggest checking out WWW::Mechanize:

http://rubyforge.org/projects/mechanize/
http://mechanize.rubyforge.org/

Or just ‘gem install mechanize’

Make sure to check out the examples in the RDoc.

–Aaron

splattael · May 15, 2006, 9:37pm

Peter S. wrote:

At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i
would not like to reinvent the wheel)?

Cheers,
Peter

I think Rubyful soup [1] will be of interest to you if you don’t want to
do the searching of the pages by hand.
Also, you’ll probably want to look into the URI [2] and OpenURI [2]
standard libraries for fetching the webpages.

HTH

-Justin

[1] Rubyful Soup: "The brush has got entangled in it!"
[2] http://ruby-doc.org/stdlib/libdoc/uri/rdoc/index.html
[3] http://ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html