Simple crawling/HTML extraction


#1

Hello all,

I would like to accomplish the following task:

  1. I have a bunch of links. Visit every link one by one

  2. For every visited link, extract a string with a regexp from the page

    1. can be recursive: If one/more of the extracted string(s) is
      a link, visit it, extract strings etc. etc. - of course this should
      stop after a specified level to prevent endless recursion
  3. At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?

Cheers,
Peter


#2

On Mon, May 15, 2006 at 05:21:27PM +0900, Peter S. wrote:

  1. At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?

I suggest checking out WWW::Mechanize:

http://rubyforge.org/projects/mechanize/
http://mechanize.rubyforge.org/

Or just ‘gem install mechanize’

Make sure to check out the examples in the RDoc.

–Aaron


#3

Peter S. wrote:

  1. At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i
would not like to reinvent the wheel)?

Cheers,
Peter

I think Rubyful soup [1] will be of interest to you if you don’t want to
do the searching of the pages by hand.
Also, you’ll probably want to look into the URI [2] and OpenURI [2]
standard libraries for fetching the webpages.

HTH

-Justin

[1] http://www.crummy.com/software/RubyfulSoup/
[2] http://ruby-doc.org/stdlib/libdoc/uri/rdoc/index.html
[3] http://ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html