Regexp help: Matching HTML having trouble w/greediness


#1

All,

I am attempting to do some matching on some HTML.

Here’s what I want:

I want to be able to match any tag which is of the form <area
…> that does NOT contain a “mailto:” href. The grouping is for some
substitution that I’m doing.

Here’s my pattern:

/(<area .?href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

What I find is that this pattern will successfully handle most area
tags, however, when an area tag is followed by a tag (which also has
a href attribute, it will match everything between <area and the end of
the
tag). So, for example,

<area href="mailto: removed_email_address@domain.invalid>other stuff, including tags

this pattern matches all the way through the end of the tag.

I tried to stick a negative lookahead (?!<) at the end of the pattern
but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes


#2

Wes G. wrote:

All,

I am attempting to do some matching on some HTML.

Here’s what I want:

I want to be able to match any tag which is of the form <area
…> that does NOT contain a “mailto:” href. The grouping is for some
substitution that I’m doing.

Here’s my pattern:

/(<area .?href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

What I find is that this pattern will successfully handle most area
tags, however, when an area tag is followed by a tag (which also has
a href attribute, it will match everything between <area and the end of
the
tag). So, for example,

<area href="mailto: removed_email_address@domain.invalid>other stuff, including tags

this pattern matches all the way through the end of the tag.

I tried to stick a negative lookahead (?!<) at the end of the pattern
but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes

This appears to work better:

/(<area [^>]?)(href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

The “any character but ‘>’” seems to stop the evaluation of the match
from making it any further than the end of the tag.

Wes


#3

On 5/23/06, Wes G. removed_email_address@domain.invalid wrote:

Here’s my pattern:

/(<area .?href=[’|"])(?!mailto:)(.?)([’|"].*?>)/mi

Let’s try
<area [^>]*?etc.
You will find out that this will solve this particular problem but
creates
other ones like e.g. with this HTML

unless, and you tell me, that is not legal HTML
anyway I feel that maybe Regexen are not the right tool for your task
anymore, dunno.
But maybe you can work with the above regex anyway.
Cheers
Robert

What I find is that this pattern will successfully handle most area

but that doesn’t seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes


Posted via http://www.ruby-forum.com/.


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

#4

Robert D. wrote:

On 5/23/06, Wes G. removed_email_address@domain.invalid wrote:

substitution that I’m doing.
<area href="mailto: removed_email_address@domain.invalid>other stuff, including tags<a
Wes


Posted via http://www.ruby-forum.com/.

So I was just typing and typing and typing…
Glad Ur happy with it.
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

Robert,

I am pretty sure that it’s illegal to put a “>” in an HTML attribute
(you would achieve the desired affect with >).

I agree that perhaps regexen are not the best way to manipulate this -
however, I’ve already been down the path of using a HTML parser
(RubyfulSoup/htmltools) and trying to output the resulting parse tree.
Unfortunately, that parser attempts to “fix” the HTML for me and when I
attempt to render the fixed HTML in a browser, it is “too fixed” for the
browser to handle.

So unfortunately, I’m forced to accept what I get and only change it so
that I don’t break any existing stuff.

Hence, I’m using regexen to do my write - manipulation.

Thanks for the help though!

Wes


#5

On 5/23/06, Wes G. removed_email_address@domain.invalid wrote:

substitution that I’m doing.
<area href="mailto: removed_email_address@domain.invalid>other stuff, including tags<a
Wes


Posted via http://www.ruby-forum.com/.

So I was just typing and typing and typing…
Glad Ur happy with it.
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

#6

On 5/23/06, Wes G. removed_email_address@domain.invalid wrote:

Robert,

I am pretty sure that it’s illegal to put a “>” in an HTML attribute

Well than you might be flying with your solution.

(you would achieve the desired affect with >).

I agree that perhaps regexen are not the best way to manipulate this -
however, I’ve already been down the path of using a HTML parser
(RubyfulSoup/htmltools) and trying to output the resulting parse tree.
Unfortunately, that parser attempts to “fix” the HTML for me and when I
attempt to render the fixed HTML in a browser, it is “too fixed” for the
browser to handle.

Not surprised :frowning:

So unfortunately, I’m forced to accept what I get and only change it so

that I don’t break any existing stuff.

Hence, I’m using regexen to do my write - manipulation.

I was not thinking about throwing them away altogether :wink: but maybe
scanning
from one attribute to another inside a tag.
But thinling about me personal expirence sometimes I got threw with
grep!

Thanks for the help though!

Well you found out for yourself :wink:

Wes


Posted via http://www.ruby-forum.com/.

Cheers
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein