Hello all,
I would like to accomplish the following task:
-
I have a bunch of links. Visit every link one by one
-
For every visited link, extract a string with a regexp from the page
- can be recursive: If one/more of the extracted string(s) is
a link, visit it, extract strings etc. etc. - of course this should
stop after a specified level to prevent endless recursion
-
At the end, merge every extracted string into one list
The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?
Cheers,
Peter
On Mon, May 15, 2006 at 05:21:27PM +0900, Peter S. wrote:
- At the end, merge every extracted string into one list
The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?
I suggest checking out WWW::Mechanize:
http://rubyforge.org/projects/mechanize/
http://mechanize.rubyforge.org/
Or just ‘gem install mechanize’
Make sure to check out the examples in the RDoc.
–Aaron
Peter S. wrote:
- At the end, merge every extracted string into one list
The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i
would not like to reinvent the wheel)?
Cheers,
Peter
I think Rubyful soup [1] will be of interest to you if you don’t want to
do the searching of the pages by hand.
Also, you’ll probably want to look into the URI [2] and OpenURI [2]
standard libraries for fetching the webpages.
HTH
-Justin
[1] Rubyful Soup: "The brush has got entangled in it!"
[2] http://ruby-doc.org/stdlib/libdoc/uri/rdoc/index.html
[3] http://ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html