Forum: Ruby page reader

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Thilankka M. (Guest)
on 2007-07-26 09:00
(Received via mailing list)
I was wondering if theres anything out there that can take web page full
of
text and search it for particular phrases I give it.. say for example I
have A web page or several linked webpages.. I could say find "how to
initiate a class".. or something like that.. This might seem likea silly
question. also if I were to attempt something like that Would I have to
use
ruby rails?..


thanks thilankka
Dan Z. (Guest)
on 2007-07-26 09:48
(Received via mailing list)
Thilanka M. wrote:
> I was wondering if theres anything out there that can take web page full of
> text and search it for particular phrases I give it.. say for example I
> have A web page or several linked webpages.. I could say find "how to
> initiate a class".. or something like that.. This might seem likea silly
> question. also if I were to attempt something like that Would I have to use
> ruby rails?..
>
>
> thanks thilankka
>

Well, there are two parts to that question. One is how to get the body
of a webpage in a string,

require 'net/http'
str = Net::HTTP.get URI.parse('http://www.ruby-lang.org/en/')

and the second is how to search within that string.

str.scan(/.*ruby.*/) do |match|
  puts "This line contained the word \"ruby\" in lower case:"
  puts match
end

This type of program is called a "screen scraper", and there are some
tools/libraries that aid in their creation, if you want something
fancier than what I outlined above. WWW::Mechanize is the classic
(ported from Perl, I think), but the one I hear the most about is
Hpricot (based on Mechanize). These support things like using cookies
and automatically following redirects, and probably even the ability to
search a page and strip the HTML out of the result (though I may be
misremembering--I have only read about these programs). If you still
want to write your own stuff but get annoyed at the clunkiness and
obscure errors that Net::HTTP can be, have a look at http-access2 here:
http://raa.ruby-lang.org/project/http-access2/ .

There are plenty of other screen scraping libraries that I didn't
mention. Just use google. Your real problem is how to actually do the
searching. For that, you will need to read up on regular expressions.
One simple test might be

matches = 0
[/how/, /build/, /initiate|instantiate/, /class/].each do |regex|
  matches += 1 if str =~ regex
end
if matches > 2
  do something with str
end

The point is, you will need to be creative and research and probably
also ask for help. (The above does something with the string if it
matches 3 or 4 of the regular expressions in the list.)

Good luck,
Dan
Thilankka M. (Guest)
on 2007-07-26 10:21
(Received via mailing list)
thanks alot .. adn yeah i've got to do more research on the topic.. I
will
propably be epriodically shouting for help when i start coding ..
thanks again
This topic is locked and can not be replied to.