Forum: Ruby Regexp help: Matching HTML having trouble w/greediness

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-23 20:05
All,

I am attempting to do some matching on some HTML.

Here's what I want:

I want to be able to match any <area> tag which is of the form <area
...> that does NOT contain a "mailto:" href.  The grouping is for some
substitution that I'm doing.

Here's my pattern:

/(<area .*?href=['|"])(?!mailto:)(.*?)(['|"].*?>)/mi

What I find is that this pattern will successfully handle most area
tags, however, when an area tag is followed by a <a> tag (which also has
a href attribute, it will match everything between <area and the end of
the <a> tag).  So, for example,

<area href="mailto: xyz@abc.com>other stuff, including tags<a
href="blah">

this pattern matches all the way through the end of the <a> tag.

I tried to stick a negative lookahead (?!<) at the end of the pattern
but that doesn't seem to help.

How do I get this pattern to STOP matching?

Thanks for any help,
Wes
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-23 21:49
Wes Gamble wrote:
> All,
>
> I am attempting to do some matching on some HTML.
>
> Here's what I want:
>
> I want to be able to match any <area> tag which is of the form <area
> ...> that does NOT contain a "mailto:" href.  The grouping is for some
> substitution that I'm doing.
>
> Here's my pattern:
>
> /(<area .*?href=['|"])(?!mailto:)(.*?)(['|"].*?>)/mi
>
> What I find is that this pattern will successfully handle most area
> tags, however, when an area tag is followed by a <a> tag (which also has
> a href attribute, it will match everything between <area and the end of
> the <a> tag).  So, for example,
>
> <area href="mailto: xyz@abc.com>other stuff, including tags<a
> href="blah">
>
> this pattern matches all the way through the end of the <a> tag.
>
> I tried to stick a negative lookahead (?!<) at the end of the pattern
> but that doesn't seem to help.
>
> How do I get this pattern to STOP matching?
>
> Thanks for any help,
> Wes


This appears to work better:

/(<area [^>]*?)(href=['|"])(?!mailto:)(.*?)(['|"].*?>)/mi

The "any character but '>'" seems to stop the evaluation of the match
from making it any further than the end of the <area> tag.

Wes
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-05-23 21:55
(Received via mailing list)
On 5/23/06, Wes Gamble <weyus@att.net> wrote:
>
> Here's my pattern:
>
> /(<area .*?href=['|"])(?!mailto:)(.*?)(['|"].*?>)/mi


Let's try
 <area [^>]*?etc.
You will find out that this will solve this particular problem but
creates
other ones like e.g. with this HTML
 <area some_att=">" href="mine:home">
unless, and you tell me, that is not legal HTML
anyway I feel that maybe Regexen are not the right tool for your task
anymore, dunno.
But maybe you can work with the above regex anyway.
Cheers
Robert

What I find is that this pattern will successfully handle most area
> but that doesn't seem to help.
>
> How do I get this pattern to STOP matching?
>
> Thanks for any help,
> Wes
>
> --
> Posted via http://www.ruby-forum.com/.
>
>


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-05-23 21:55
(Received via mailing list)
On 5/23/06, Wes Gamble <weyus@att.net> wrote:
> > substitution that I'm doing.
> > <area href="mailto: xyz@abc.com>other stuff, including tags<a
> > Wes
>
> --
> Posted via http://www.ruby-forum.com/.
>
>
So I was just typing and typing and typing...
Glad Ur happy with it.
Robert


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
Bb4bdf2b184027bc38d4fb529770cde5?d=identicon&s=25 Wes Gamble (weyus)
on 2006-05-23 22:06
Robert Dober wrote:
> On 5/23/06, Wes Gamble <weyus@att.net> wrote:
>> > substitution that I'm doing.
>> > <area href="mailto: xyz@abc.com>other stuff, including tags<a
>> > Wes
>>
>> --
>> Posted via http://www.ruby-forum.com/.
>>
>>
> So I was just typing and typing and typing...
> Glad Ur happy with it.
> Robert
>
>
> --
> Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
> concerne l'univers, je n'en ai pas acquis la certitude absolue.
>
> - Albert Einstein

Robert,

I am pretty sure that it's illegal to put a ">" in an HTML attribute
(you would achieve the desired affect with &gt;).

I agree that perhaps regexen are not the best way to manipulate this -
however, I've already been down the path of using a HTML parser
(RubyfulSoup/htmltools) and trying to output the resulting parse tree.
Unfortunately, that parser attempts to "fix" the HTML for me and when I
attempt to render the fixed HTML in a browser, it is "too fixed" for the
browser to handle.

So unfortunately, I'm forced to accept what I get and only change it so
that I don't break any existing stuff.

Hence, I'm using regexen to do my write - manipulation.

Thanks for the help though!

Wes
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2006-05-23 22:21
(Received via mailing list)
On 5/23/06, Wes Gamble <weyus@att.net> wrote:
> >>
>
> Robert,
>
> I am pretty sure that it's illegal to put a ">" in an HTML attribute


Well than you might be flying with your solution.

(you would achieve the desired affect with &gt;).
>
> I agree that perhaps regexen are not the best way to manipulate this -
> however, I've already been down the path of using a HTML parser
> (RubyfulSoup/htmltools) and trying to output the resulting parse tree.
> Unfortunately, that parser attempts to "fix" the HTML for me and when I
> attempt to render the fixed HTML in a browser, it is "too fixed" for the
> browser to handle.


Not surprised :(

So unfortunately, I'm forced to accept what I get and only change it so
> that I don't break any existing stuff.
>
> Hence, I'm using regexen to do my write - manipulation.


I was not thinking about throwing them away altogether ;) but maybe
scanning
from one attribute to another inside a tag.
But thinling about me personal expirence sometimes I got threw with
grep!

Thanks for the help though!


Well you found out for yourself ;)

Wes
>
> --
> Posted via http://www.ruby-forum.com/.


Cheers
Robert




--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
This topic is locked and can not be replied to.