Forum: Ruby html stringScanner regexp

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Tomas F. (Guest)
on 2006-05-04 03:12
Hi,

I want to parse a html string like this:

...
<div align=left><a href=# class=title> title1 </a></div>
...
<div align=left><a href=# class=title> title2 </a></div>
...
<div align=left><a href=# class=title> title3 </a></div>
...
<div align=left><a href=# class=title> title4 </a></div>
...

"..." means other stuff.
I need to extract title1 to title4, so I tried

ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
only the last title -- title4. Why? Is the regex wrong, or do I miss the
point with the scan method?

Best regards
Tomas
Mike F. (Guest)
on 2006-05-04 03:34
Tomas F. wrote:
[...]
>
> ScannerScan.scan(/.*class=title>(.*)<\/a><\/div>_NEWLINE_/) But I get
> only the last title -- title4. Why? Is the regex wrong, or do I miss the
> point with the scan method?

The .* is greedy and gobbles up as much of the source as it can (up to
the end of the string) and then the regex engine backtracks just enough
to match the last occurance.  You might try this instead:

%r{ class=title>(.*?)</a></div>\n}

But remember that unless you can guarantee the formatting of your input
won't vary much you're probably better off using a proper HTML parser to
handle HTML rather than regexen.
This topic is locked and can not be replied to.