Regex negative look-behind bug?

okkezSS · November 23, 2010, 3:36pm

irb, Ruby 1.9.1

What am I missing here?

“b T T W b”.match(/(?<!t t|a b) w/i)
=> nil

#The second look-behind is now just a
“b T T W b”.match(/(?<!t t|a) w/i)
=> #<MatchData " W">

#Regex stays the same, the T T are now in lower case
“b t t W b”.match(/(?<!t t|a) w/i)
=> nil

#Look-behind only contains the t t condition now and, T T are back to
upper case
“b T T W b”.match(/(?<!t t) w/i)
=> nil

rubynut · November 23, 2010, 4:39pm

On Tue, Nov 23, 2010 at 4:36 PM, Ruby N. [email protected] wrote:

#Regex stays the same, the T T are now in lower case
“b t t W b”.match(/(?<!t t|a) w/i)
=> nil

#Look-behind only contains the t t condition now and, T T are back to
upper case
“b T T W b”.match(/(?<!t t) w/i)
=> nil

No bug here. It is doing exactly what you asked: only match a w if it is
not
preceded by ‘t t’. In all cases the w is preceded by ‘t t’, and in the
case
that did match (?<!t t|a), the w was preceded by a ‘t t’ but not an ‘a’,
as
you asked, so it did match.

“b Y T W b”.match( /(?<!t t) w/i )
=> #<MatchData " W">

Regards,
Ammar

rubynut · November 23, 2010, 4:55pm

On Tue, Nov 23, 2010 at 4:36 PM, Ammar A. [email protected]
wrote:

“b T T W b”.match(/(?<!t t|a) w/i)

No bug here. It is doing exactly what you asked: only match a w if it is not
preceded by ‘t t’. In all cases the w is preceded by ‘t t’, and in the case
that did match (?<!t t|a), the w was preceded by a ‘t t’ but not an ‘a’, as
you asked, so it did match.

That was an alternative! If the RX in the lookbehind can match, the
negative lookbehind must fail IMHO.

There is a problem with the match though. I suspect there is an issue
with case sensitivity propagation

irb(main):009:0> “b T T W b”.match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> “b T T W b”.match(/(?i:<!t t|a) w/i)
=> nil

irb(main):013:0> RUBY_VERSION
=> “1.9.1”
irb(main):014:0> RUBY_PATCHLEVEL
=> 430

Kind regards

robert

rubynut · November 23, 2010, 5:13pm

On Tue, Nov 23, 2010 at 5:55 PM, Robert K.
[email protected]wrote:

#The second look-behind is now just a
=> nil
negative lookbehind must fail IMHO.

The thing is what’s in the lookbehind, and all assertions for that
matter,
is not really a regular expression. It is a fixed length literal. The
only
exception, AFAIK, is character sets because they are also fixed length.
The
engine needs to know how many characters to step back and examine.

Also the first alternative that matches wins. Here it is in lower case
and
without ignoring case:

“b t t w b”.match( /(?<!t t|a) w/ )
=> nil

There is a problem with the match though. I suspect there is an issue

with case sensitivity propagation

irb(main):009:0> “b T T W b”.match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> “b T T W b”.match(/(?i:<!t t|a) w/i)
=> nil

That’s not a valid assertion any more, it is now an options
specification.

“b <!t t w b”.match( /(?i:<!t t|a) w/ )
=> #<MatchData “<!t t w”>

irb(main):013:0> RUBY_VERSION
=> “1.9.1”
irb(main):014:0> RUBY_PATCHLEVEL
=> 430

I initially tried the cases with 1.9.2, but I tried the above with the
latest 1.9.1 on my system (a bit older).

RUBY_VERSION
=> “1.9.1”
RUBY_PATCHLEVEL
=> 378

Regards,
Ammar

rubynut · November 24, 2010, 11:57am

On Tue, Nov 23, 2010 at 5:12 PM, Ammar A. [email protected]
wrote:

“b T T W b”.match(/(?<!t t|a b) w/i)
#Look-behind only contains the t t condition now and, T T are back to
you asked, so it did match.

That was an alternative! If the RX in the lookbehind can match, the
negative lookbehind must fail IMHO.

The thing is what’s in the lookbehind, and all assertions for that matter,
is not really a regular expression. It is a fixed length literal. The only
exception, AFAIK, is character sets because they are also fixed length. The
engine needs to know how many characters to step back and examine.

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. “|” is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

str = [“bc”, “abc”, “a|bc”, “a\|bc”]
rxs = [/(?<=ab)c/,/(?<=a|b)c/,/(?<=a|b)c/]

str.each do |s|
rxs.each do |r|
printf “%-10s %-15p %p\n”, s, r, s.scan(r)
end
end

10:45:45 ~$

Docs even say “In negative-look-behind, captured group isn’t allowed,
but shy group(? is allowed.” So it’s a regexp albeit a limited one.

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

irb(main):009:0> “b T T W b”.match(/(?<!t t|a) w/i)
=> #<MatchData " W">
irb(main):010:0> “b T T W b”.match(/(?i:<!t t|a) w/i)
=> nil

That’s not a valid assertion any more, it is now an options specification.

“b <!t t w b”.match( /(?i:<!t t|a) w/ )
=> #<MatchData “<!t t w”>

Right, apparently we cannot have options in assertions.

=> “1.9.1”

RUBY_PATCHLEVEL
=> 378

The root issue still exists

irb(main):014:0> “a ac”.scan /(?<!a a|b)c/i
=> []
irb(main):015:0> “A Ac”.scan /(?<!a a|b)c/i
=> [“c”]
irb(main):016:0> “ac”.scan /(?<!a|b)c/i
=> []
irb(main):017:0> “Ac”.scan /(?<!a|b)c/i
=> []

Statement 15 should not yield any results in the same way as 17 does.
Apparently /i breaks in if there is an alternative (“|”) in
conjunction with more than one chars in one alternative:

Fails (more than 1 char AND alternative)

irb(main):018:0> “aac”.scan /(?<!aa|b)c/i
=> []
irb(main):019:0> “AAc”.scan /(?<!aa|b)c/i
=> [“c”]
irb(main):020:0> “Aac”.scan /(?<!aa|b)c/i
=> [“c”]
irb(main):021:0> “aAc”.scan /(?<!aa|b)c/i
=> [“c”]

Works (more then 1 char OR alternative):

irb(main):022:0> “aac”.scan /(?<!aa)c/i
=> []
irb(main):023:0> “aAc”.scan /(?<!aa)c/i
=> []
irb(main):024:0> “Aac”.scan /(?<!aa)c/i
=> []
irb(main):025:0> “AAc”.scan /(?<!aa)c/i
=> []
irb(main):026:0> “ac”.scan /(?<!a)c/i
=> []
irb(main):027:0> “Ac”.scan /(?<!a)c/i
=> []
irb(main):028:0> “ac”.scan /(?<!a|b)c/i
=> []
irb(main):029:0> “Ac”.scan /(?<!a|b)c/i
=> []

IMHO this is a bug.

Kind regards

robert

rubynut · November 24, 2010, 1:37pm

On Wed, Nov 24, 2010 at 12:57 PM, Robert K.
[email protected]wrote:

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. “|” is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

Yes, please excuse the terseness of my last response. I wrote it as I
was
rushing out the door.

What I meant, but did not properly clarify, is; the contents of
assertions
are not full expressions. They can not contain quantifiers, they can
not
contain captures, and they can not include backreferences or anything
that
can complicate determining the length of the contents. Obviously
alternation
is allowed, since that’s what we were discussing. However, only as long
as
the alternatives abide by the limitations.

Ruby’s regular expression engine is quite flexible in this regard, as it
allows the alternatives to be of different lengths, unlike some other
engines that require them to be of the same length.

----8<----

----8<----

IMHO this is a bug.

OK, now that we’ve eliminated the syntax and the double-negative
confusion,
I see the issue clearly. Thank you for your patience

It might be a bug, but since the contents of assertions do not go
through
the full eval/exec cycle of “regular” regular expressions, this could be
just another limitation of assertions. It might be difficult to figure
out
the last options in effect because they can be inserted multiple times
in an
expression, on their own (from here on) and they can be nested. Which of
these would be used? Maybe just use the top level options? That can
potentially introduce more confusion.

Anyway, it’s definitely worth reporting. Worst case, we’ll find out it’s
a
limitation, and best case, it will end being a feature request, if not a
bug.

Is the OP able/willing to report this?

http://redmine.ruby-lang.org/

Regards,
Ammar

rubynut · November 24, 2010, 5:37pm

Ammar, Robert,

Thank you both for your healthy discussions. I’m glad that I’m not crazy
and you guys agree that it’s probably a bug or a very very special
feature

You guys understand the underlying issue and implications much better
than I do. I think it’d be better if one of you reported this instead of
I. Please don’t fight over it

Thanks again.

rubynut · November 25, 2010, 9:41am

On Wed, Nov 24, 2010 at 5:37 PM, Ruby N. [email protected] wrote:

Thank you both for your healthy discussions. I’m glad that I’m not crazy
and you guys agree that it’s probably a bug or a very very special
feature

You’re welcome.

You guys understand the underlying issue and implications much better
than I do. I think it’d be better if one of you reported this instead of
I. Please don’t fight over it

Done.

http://redmine.ruby-lang.org/issues/show/4088

Cheers

robert

rubynut · November 24, 2010, 3:00pm

On Wed, Nov 24, 2010 at 1:32 PM, Ammar A. [email protected]
wrote:

The

engine needs to know how many characters to step back and examine.

Docs say that the regexp cannot be unlimited. But it is by far not
only a fixed length literal. “|” is certainly meta in an assertion -
the second line would not match if the lookbehind assertion was a
literal.

Yes, please excuse the terseness of my last response. I wrote it as I was
rushing out the door.

Probably not the best thing to do. I know. It has happened to me as
well.

=> []
I see the issue clearly. Thank you for your patience
YWC.

It might be a bug, but since the contents of assertions do not go through
the full eval/exec cycle of “regular” regular expressions, this could be
just another limitation of assertions. It might be difficult to figure out
the last options in effect because they can be inserted multiple times in an
expression, on their own (from here on) and they can be nested. Which of
these would be used? Maybe just use the top level options? That can
potentially introduce more confusion.

I don’t see any difference in finding out options to other grouping
constructs: the innermost surrounding flags should be used. Every
other rule would be utmost confusing.

irb(main):002:0> “aBc”.scan /(?i:a(?:b)c)/
=> [“aBc”]

irb(main):005:0> “abCde”.scan /(?-i:a(?i:b(?:c)d)e)/i
=> [“abCde”]
irb(main):008:0> “abCde”.scan /a(?i:b(?:c)d)e/
=> [“abCde”]

Anyway, it’s definitely worth reporting. Worst case, we’ll find out it’s a
limitation, and best case, it will end being a feature request, if not a
bug.

I vote for “bug”.

Is the OP able/willing to report this?

http://redmine.ruby-lang.org/

Please, do.

Cheers

robert

rubynut · November 25, 2010, 1:46pm

On Wed, Nov 24, 2010 at 6:37 PM, Ruby N. [email protected] wrote:

Ammar, Robert,

Thank you both for your healthy discussions. I’m glad that I’m not crazy
and you guys agree that it’s probably a bug or a very very special
feature

You’re welcome. I’m fascinated by Oniguruma (ruby’s regex engine) so
this is
much fun for me.

Regards,
Ammar

rubynut · November 29, 2010, 9:31am

On Thu, Nov 25, 2010 at 1:50 PM, Ammar A. [email protected]
wrote:

On Thu, Nov 25, 2010 at 10:41 AM, Robert K.
[email protected]wrote:

Done.

http://redmine.ruby-lang.org/issues/show/4088

I’m glad you reported it because I’m still on the fence about it being a
bug. I think the negative match (when it matches it doesn’t) and alternation
are confusing in this case.

It’s fixed already.

http://redmine.ruby-lang.org/issues/show/4088

Cheers

robert

rubynut · November 25, 2010, 1:51pm

On Thu, Nov 25, 2010 at 10:41 AM, Robert K.
[email protected]wrote:

Done.

http://redmine.ruby-lang.org/issues/show/4088

Thanks. I was too busy to follow up yesterday.

I’m glad you reported it because I’m still on the fence about it being a
bug. I think the negative match (when it matches it doesn’t) and
alternation
are confusing in this case.

IMHO, the following examples prove that ignoring case works as expected,
but
it’s difficult to verify this with when alternation is added to the mix.

Expected, not ignoring case

“abcd” =~ /(?<!bc)d/
nil

Expected, case differs, and it’s not being ignored

“aBcd” =~ /(?<!bc)d/
3

Expected, case differs, but it’s being ignored

“aBcd” =~ /(?<!bc)d/i
nil

Adding alternation is playing a part in either making it hard to tell
which
part is matching, or not respecting the i option.

Thanks again,
Ammar

rubynut · November 29, 2010, 10:04am

On Mon, Nov 29, 2010 at 10:22 AM, Robert K.
[email protected] wrote:

are confusing in this case.

It’s fixed already.

http://redmine.ruby-lang.org/issues/show/4088

Hallelujah!

Cheers,
Ammar