I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?
On Wed, 5 Mar 2008 00:04:52 +0900, phoenix [email protected]
wrote:
I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?
–
I recommend that you use the hpricot gem with lets you use xpath
expressions on html.
Install by typing this into a command line:
gem install hpricot
Here’s a little example that extracts the link texts from a google
search:
require ‘rubygems’
require ‘open-uri’
require ‘hpricot’
g = Hpricot(open(“hpricot xpath - Google Search”))
(g/“a[@class=‘l’]”).each { |hit|
puts “#{(hit/“text()”)}”
}
nil
Kristian
On 04.03.2008 16:04, phoenix wrote:
I 'm new to ruby.I want to extract some useful information from a web
page to generate a RSS feeds.My first instinct is to provide a regular
expression like /sometext(.+?)sometext/, the problem is I can only get
the first match to this regex, how can I iterate over the multiple
matches?
String#scan.
And further more, this kind of naive solution, is it too slow to
search over a very large text?Because the performance requirement is
high.Is there a better way to do this?
Try it out. Also Hpricot like Kristian suggested.
Cheers
robert