Forum: Ruby String scan on html

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
bruno.bazzani (Guest)
on 2005-12-02 01:37
(Received via mailing list)
Please consider the following code:

require 'net/http'
Net::HTTP.start('weather.gmdss.org') do |http|
  response = http.get('/III.html')
  response.body.scan(/<a.*a>/) {|link| puts "#{link}\n\n"}
end

I expect to have each html link printed separately but this is true only
for the first three. The others are grouped together in two group.

This is what I get.
---------------------------------------
<a href="http://www.wmo.ch/index-fr.html" target="_blank"><img
src="image/WMO.jpg" alt="WMO" border="0"></a>

<a href="http://www.meteo.fr" target="_blank"><img src="image/mf.gif"
alt="MF" border="0"></a>

<a href="http://www.jcommweb.net/" target="_blank"><img
src="image/jcomm90.jpg" alt="JCOMM" border="0"></a>

<a class="local" href="index.html">HOME PAGE</a><br><a class="local"
href="I.html">METAREA I</a><br><a class="local" href="II.html">METAREA
II</a><br><a class="local" href="III.html"><b>METAREA III</b></a>
[cut]

<a href="index.html">HOME PAGE</a> - <a
href="III.html"><b>III</b></a></p>
[cut]
-----------------------------------------

Any help will be really appreciated.

Bruno
gavin (Guest)
on 2005-12-02 01:41
(Received via mailing list)
response.body.scan(/<a.*?a>/)

(Note the question mark.)

Read up on greedy versus non-greedy matching in regular expressions.
bruno.bazzani (Guest)
on 2005-12-02 02:13
(Received via mailing list)
> Read up on greedy versus non-greedy matching in regular expressions.

Thanks !
No need to say that I'm a newbie. I'm coding with the pickaxe manual on
my
side and yes ... I miss the point: sorry.

But why did it work for the first three occurrences ?

Bruno
M.B.Smillie (Guest)
on 2005-12-02 03:18
(Received via mailing list)
On Dec 2, 2005, at 0:10, Bruno Bazzani wrote:

>> Read up on greedy versus non-greedy matching in regular expressions.
>
> Thanks !
> No need to say that I'm a newbie. I'm coding with the pickaxe
> manual on my side and yes ... I miss the point: sorry.
>
> But why did it work for the first three occurrences ?

Because they're on their own line to begin with, and regular
expressions (by default) work on a line-by-line basis.

You will probably also want to include the multiline option in your
expression, otherwise you'll fail in situations like this as well:

<a href="foo.html">

un-necessary whitespace

</a>


matthew smillie.
This topic is locked and can not be replied to.