String scan on html

bruno.bazzani · December 2, 2005, 12:37am

Please consider the following code:

require ‘net/http’
Net::HTTP.start(‘weather.gmdss.org’) do |http|
response = http.get(‘/III.html’)
response.body.scan(/<a.*a>/) {|link| puts “#{link}\n\n”}
end

I expect to have each html link printed separately but this is true only
for the first three. The others are grouped together in two group.

This is what I get.

HOME PAGE
METAREA I
METAREA
II
METAREA III
[cut]

HOME PAGE - III

[cut]

Any help will be really appreciated.

Bruno

bruno.bazzani · December 2, 2005, 12:41am

response.body.scan(/<a.*?a>/)

(Note the question mark.)

Read up on greedy versus non-greedy matching in regular expressions.

bruno.bazzani · December 2, 2005, 1:13am

Read up on greedy versus non-greedy matching in regular expressions.

Thanks !
No need to say that I’m a newbie. I’m coding with the pickaxe manual on
my
side and yes … I miss the point: sorry.

But why did it work for the first three occurrences ?

Bruno

bruno.bazzani · December 2, 2005, 2:18am

On Dec 2, 2005, at 0:10, Bruno Bazzani wrote:

Read up on greedy versus non-greedy matching in regular expressions.

Thanks !
No need to say that I’m a newbie. I’m coding with the pickaxe
manual on my side and yes … I miss the point: sorry.

But why did it work for the first three occurrences ?

Because they’re on their own line to begin with, and regular
expressions (by default) work on a line-by-line basis.

You will probably also want to include the multiline option in your
expression, otherwise you’ll fail in situations like this as well:

un-necessary whitespace

matthew smillie.