Regexp help: Matching HTML having trouble w/greediness

weyus · May 23, 2006, 8:05pm

All,

I am attempting to do some matching on some HTML.

Here’s what I want:

I want to be able to match any tag which is of the form <area
…> that does NOT contain a “mailto:” href. The grouping is for some
substitution that I’m doing.

Here’s my pattern:

/(<area .?href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

What I find is that this pattern will successfully handle most area
tags, however, when an area tag is followed by a tag (which also has
a href attribute, it will match everything between <area and the end of
the tag). So, for example,

<area href="mailto: [email protected]>other stuff, including tags

this pattern matches all the way through the end of the tag.

I tried to stick a negative lookahead (?!<) at the end of the pattern
but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes

weyus · May 23, 2006, 9:49pm

Wes G. wrote:

All,

I am attempting to do some matching on some HTML.

Here’s what I want:

I want to be able to match any tag which is of the form <area
…> that does NOT contain a “mailto:” href. The grouping is for some
substitution that I’m doing.

Here’s my pattern:

/(<area .?href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

What I find is that this pattern will successfully handle most area
tags, however, when an area tag is followed by a tag (which also has
a href attribute, it will match everything between <area and the end of
the tag). So, for example,

<area href="mailto: [email protected]>other stuff, including tags

this pattern matches all the way through the end of the tag.

I tried to stick a negative lookahead (?!<) at the end of the pattern
but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes

This appears to work better:

/(<area [^>]?)(href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

The “any character but ‘>’” seems to stop the evaluation of the match
from making it any further than the end of the tag.

Wes

weyus · May 23, 2006, 9:55pm

On 5/23/06, Wes G. [email protected] wrote:

Here’s my pattern:

/(<area .?href=['|"])(?!mailto:)(.?)(['|"].*?>)/mi

Let’s try
<area [^>]*?etc.
You will find out that this will solve this particular problem but
creates
other ones like e.g. with this HTML

unless, and you tell me, that is not legal HTML
anyway I feel that maybe Regexen are not the right tool for your task
anymore, dunno.
But maybe you can work with the above regex anyway.
Cheers
Robert

What I find is that this pattern will successfully handle most area

but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes

–
Posted via http://www.ruby-forum.com/.

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

weyus · May 23, 2006, 10:06pm

Robert D. wrote:

On 5/23/06, Wes G. [email protected] wrote:

substitution that I’m doing.
<area href="mailto: [email protected]>other stuff, including tags<a
Wes

–
Posted via http://www.ruby-forum.com/.

So I was just typing and typing and typing…
Glad Ur happy with it.
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

Robert,

I am pretty sure that it’s illegal to put a “>” in an HTML attribute
(you would achieve the desired affect with >).

I agree that perhaps regexen are not the best way to manipulate this -
however, I’ve already been down the path of using a HTML parser
(RubyfulSoup/htmltools) and trying to output the resulting parse tree.
Unfortunately, that parser attempts to “fix” the HTML for me and when I
attempt to render the fixed HTML in a browser, it is “too fixed” for the
browser to handle.

So unfortunately, I’m forced to accept what I get and only change it so
that I don’t break any existing stuff.

Hence, I’m using regexen to do my write - manipulation.

Thanks for the help though!

Wes

weyus · May 23, 2006, 9:55pm

On 5/23/06, Wes G. [email protected] wrote:

substitution that I’m doing.
<area href="mailto: [email protected]>other stuff, including tags<a
Wes

–
Posted via http://www.ruby-forum.com/.

So I was just typing and typing and typing…
Glad Ur happy with it.
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

weyus · May 23, 2006, 10:21pm

On 5/23/06, Wes G. [email protected] wrote:

Robert,

I am pretty sure that it’s illegal to put a “>” in an HTML attribute

Well than you might be flying with your solution.

(you would achieve the desired affect with >).

I agree that perhaps regexen are not the best way to manipulate this -
however, I’ve already been down the path of using a HTML parser
(RubyfulSoup/htmltools) and trying to output the resulting parse tree.
Unfortunately, that parser attempts to “fix” the HTML for me and when I
attempt to render the fixed HTML in a browser, it is “too fixed” for the
browser to handle.

Not surprised

So unfortunately, I’m forced to accept what I get and only change it so

that I don’t break any existing stuff.

Hence, I’m using regexen to do my write - manipulation.

I was not thinking about throwing them away altogether but maybe
scanning
from one attribute to another inside a tag.
But thinling about me personal expirence sometimes I got threw with
grep!

Thanks for the help though!

Well you found out for yourself

Wes

–
Posted via http://www.ruby-forum.com/.

Cheers
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein