Forum: Ruby Simple crawling/HTML extraction

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Peter S. (Guest)
on 2006-05-15 12:22
(Received via mailing list)
Hello all,

I would like to accomplish the following task:

1) I have a bunch of links. Visit every link one by one
2) For every visited link, extract a string with a regexp from the page

    2) can be recursive: If one/more of the extracted string(s) is
    a link, visit it, extract strings  etc. etc. - of course this should
    stop after a specified level to prevent endless recursion

3) At the end, merge every extracted string into one list

The question is: do i have to do everything by hand or are there some
higher level APIs for such stuff? What should i look at into (if i would
not like to reinvent the wheel)?

Cheers,
Peter
Aaron P. (Guest)
on 2006-05-15 22:19
(Received via mailing list)
On Mon, May 15, 2006 at 05:21:27PM +0900, Peter S. wrote:
>
> 3) At the end, merge every extracted string into one list
>
> The question is: do i have to do everything by hand or are there some
> higher level APIs for such stuff? What should i look at into (if i would
> not like to reinvent the wheel)?

I suggest checking out WWW::Mechanize:

http://rubyforge.org/projects/mechanize/
http://mechanize.rubyforge.org/

Or just 'gem install mechanize'

Make sure to check out the examples in the RDoc.

--Aaron
Justin C. (Guest)
on 2006-05-15 23:37
(Received via mailing list)
Peter S. wrote:
>
> 3) At the end, merge every extracted string into one list
>
> The question is: do i have to do everything by hand or are there some
> higher level APIs for such stuff? What should i look at into (if i
> would not like to reinvent the wheel)?
>
> Cheers,
> Peter
>

I think Rubyful soup [1] will be of interest to you if you don't want to
do the searching of the pages by hand.
Also, you'll probably want to look into the URI [2] and OpenURI [2]
standard libraries for fetching the webpages.

HTH

-Justin


[1] http://www.crummy.com/software/RubyfulSoup/
[2] http://ruby-doc.org/stdlib/libdoc/uri/rdoc/index.html
[3] http://ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html
This topic is locked and can not be replied to.