Extract information from a large text

I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?

On Wed, 5 Mar 2008 00:04:52 +0900, phoenix [email protected]
wrote:

I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?

I recommend that you use the hpricot gem with lets you use xpath
expressions on html.

Install by typing this into a command line:

gem install hpricot

Here’s a little example that extracts the link texts from a google
search:

require ‘rubygems’
require ‘open-uri’
require ‘hpricot’

g = Hpricot(open(“hpricot xpath - Google Search”))
(g/“a[@class=‘l’]”).each { |hit|
puts “#{(hit/“text()”)}”
}
nil

    Kristian

On 04.03.2008 16:04, phoenix wrote:

I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?

String#scan.

And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?

Try it out. Also Hpricot like Kristian suggested.

Cheers

robert