Is there a way to abandon a gsub if you're using a block?

weyus · June 25, 2009, 11:59pm

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can’t quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don’t want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Wes

weyus · June 26, 2009, 12:10am

Wes G. [email protected] writes:

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can’t quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don’t want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Can’t you simply return the match as a no-op ?

weyus · June 26, 2009, 12:34am

Hi –

On Fri, 26 Jun 2009, Wes G. wrote:

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can’t quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don’t want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Just return the original:

“abcdef”.gsub(/./) {|s| if s == ‘e’ then s else ‘z’ end }
=> “zzzzez”

David

weyus · June 26, 2009, 10:27am

2009/6/25 Wes G. [email protected]:

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can’t quite get the regex working, but I will be

If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

Kind regards

robert

weyus · June 26, 2009, 6:14pm

Robert K. wrote:

If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

self.html.gsub!(/<a\s+?[^>]?href=([’"]) #<a up to and including
href=’ or href="
(?!mailto:)(.?) #Contents of any non-mailto:
href attribute
\1.?> #End of href attribute (same
quote) + arbitrary text to end of opening tag
(.?) #Contents of - the “link
display”
<\?/a>/mix) { #Closing tag, allowing
for optional , e.g. or </a>

So, this regex is attempting to pull out the contents of an href in a
tag, as well as the content enclosed by the tag.

The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:

……

and there is no closing for the first anchor. What I want to pull
is the valid tag “on the inside”, but what I get is the first
tag up to the closing tag, which is not correct. The problem is
that the first tag just shouldn’t be there at all.

So I need to modify my regex to not match if there is a tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn’t quite get it.

Thanks,
Wes

weyus · June 26, 2009, 8:58pm

Robert,

Many thanks,

Wes

weyus · June 26, 2009, 8:51pm

On 26.06.2009 18:14, Wes G. wrote:

            \1.*?>                      #End of href attribute (same 
HTML which looks like this:

……

and there is no closing for the first anchor. What I want to pull
is the valid tag “on the inside”, but what I get is the first
tag up to the closing tag, which is not correct. The problem is
that the first tag just shouldn’t be there at all.

Another way to put it is that you want to match … without any
intermediate .

So I need to modify my regex to not match if there is a tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn’t quite get it.

So the basic pattern here is that you want to match a combination A…B
without any A in between.

We try with a simple example:

irb(main):005:0> s = ‘…A;;A+++B’
=> “…A;;A+++B”
irb(main):006:0> s.scan %r{A(?:.(?!A))+B}
=> [“A+++B”]

Now with HTML like string:

irb(main):008:0> t = s.gsub(/A/, ‘’).gsub(/B/, ‘’)
=> “…<a href=“foo”>;;<a href=“foo”>+++”
irb(main):017:0> t.scan
%r{<a(?:\s+\w+=["’][^"’]["’])>(?:.(?!<a))*?}i
=> ["<a href=“foo”>+++"]

A bit more readable

irb(main):024:0> t.scan %r{
irb(main):025:0/ <a(?:\s+\w+=["’][^"’]["’])> # opening tag
irb(main):026:0/ (?:.(?!<a))*? # between and
irb(main):027:0/ # closing tag
irb(main):028:0/ }mix
=> ["<a href=“foo”>+++"]

The trick is to have a negative lookahead assertion on each character
between the beginning and ending sequence. Thus avoiding a match if the
opening sequence appears anywhere in between.

Kind regards

robert

weyus · June 26, 2009, 9:01pm

So the way to read this:

(?:.(?!<a))*?

would be

“match on any character as long as it isn’t followed by a ‘<a’”

Why do you need the positive lookahead assertion though - to ensure that
the characters aren’t consumed in case of a bad match?

Wes

weyus · June 27, 2009, 2:41pm

On 26.06.2009 21:01, Wes G. wrote:

So the way to read this:

(?:.(?!<a))*?

would be

“match on any character as long as it isn’t followed by a ‘<a’”

Exactly.

Why do you need the positive lookahead assertion though - to ensure that
the characters aren’t consumed in case of a bad match?

What positive lookahead?

Kind regards

robert

weyus · June 27, 2009, 6:24pm

My mistake - ?: doesn’t generate backreferences, I thought it was a
positive lookahead.