Html stringScanner regexp


#1

Hi,

I want to parse a html string like this:

title1
...
title2
...
title3
...
title4
...

“…” means other stuff.
I need to extract title1 to title4, so I tried

ScannerScan.scan(/.class=title>(.)</a></div>NEWLINE/) But I get
only the last title – title4. Why? Is the regex wrong, or do I miss the
point with the scan method?

Best regards
Tomas


#2

Tomas F. wrote:
[…]

ScannerScan.scan(/.class=title>(.)</a></div>NEWLINE/) But I get
only the last title – title4. Why? Is the regex wrong, or do I miss the
point with the scan method?

The .* is greedy and gobbles up as much of the source as it can (up to
the end of the string) and then the regex engine backtracks just enough
to match the last occurance. You might try this instead:

%r{ class=title>(.*?)\n}

But remember that unless you can guarantee the formatting of your input
won’t vary much you’re probably better off using a proper HTML parser to
handle HTML rather than regexen.