Possible regular expression

Ruby’s regular expression engine appears to act incorrectly when given a
non-greedy match-range of the form {m,n}?

Take this example:

“Age: 21” =~ /Age.{0,60}: ([\w]+)/

This returns 0, as expected and $1 is set to “21”

However:

“Age: 21” =~ /Age.{0,60}?: ([\w]+)/

This returns nil and $1 is set to nil.

I believe the greedy and non-greedy cases should be equivalent in this
case, but are not.

I’ve included a tarball with two files, one written in perl and the
other in ruby, performing this match. The perl script acts as expected.

Apologies if this is a known bug that I have been unable to find on
RubyForge, or if this is expected behavior. If it is the former I would
appreciate anyone who could point me at the bug listing, and if it is
the latter, I would appreciate enlightenment on the reason for this
behavior. In any other case, it would be much appreciated for anyone to
verify that this behavior is a bug, and I will file it.

Thanks

On May 5, 7:42 pm, James S. [email protected] wrote:

“Age: 21” =~ /Age.{0,60}?: ([\w]+)/

This returns nil and $1 is set to nil.

This seems like a bug, given:
s = “Age: 21”
s =~ /Age.: (\w+)/ #=> 0
s =~ /Age.
?: (\w+)/ #=> 0
s =~ /Age.{0,60}: (\w+)/ #=> 0
s =~ /Age.{0,60}?: (\w+)/ #=> nil

(Perhaps you were pairing down a real-world testcase; did you know
that you can simply use \w+ instead of [\w]+ to match one-or-more-word-
characters? And that \d may be more appropriate, matching only digit
characters?)

My simple experiments make me believe this is an edge case
specifically when:
a) a non-greedy range
b) that is matching any-char
c) has a lower-limit of 0
d) and must match 0 times to succeed.

Here’s my test data, with analysis following.

s = “abbc”
%w|
ab{1,9}c ab{1,9}?c
abb{1,9}c abb{1,9}?c
abbb{1,9}c abbb{1,9}?c
ab{0,9}c ab{0,9}?c
abb{0,9}c abb{0,9}?c
abbb{0,9}c abbb{0,9}?c
a.{1,9}c a.{1,9}?c
ab.{1,9}c ab.{1,9}?c
abb.{1,9}c abb.{1,9}?c
a.{0,9}c a.{0,9}?c
ab.{0,9}c ab.{0,9}?c
abb.{0,9}c abb.{0,9}?c
|.each_with_index{ |pattern,i|
regex = Regexp.new( pattern )
puts “%2i %-15s %s” % [
i, regex.inspect, (s =~ regex).inspect
]
}

#=> 0 /ab{1,9}c/ 0
#=> 1 /ab{1,9}?c/ 0
#=> 2 /abb{1,9}c/ 0
#=> 3 /abb{1,9}?c/ 0
#=> 4 /abbb{1,9}c/ nil
#=> 5 /abbb{1,9}?c/ nil
#=> 6 /ab{0,9}c/ 0
#=> 7 /ab{0,9}?c/ 0
#=> 8 /abb{0,9}c/ 0
#=> 9 /abb{0,9}?c/ 0
#=> 10 /abbb{0,9}c/ 0
#=> 11 /abbb{0,9}?c/ 0
#=> 12 /a.{1,9}c/ 0
#=> 13 /a.{1,9}?c/ 0
#=> 14 /ab.{1,9}c/ 0
#=> 15 /ab.{1,9}?c/ 0
#=> 16 /abb.{1,9}c/ nil
#=> 17 /abb.{1,9}?c/ nil
#=> 18 /a.{0,9}c/ 0
#=> 19 /a.{0,9}?c/ 0
#=> 20 /ab.{0,9}c/ 0
#=> 21 /ab.{0,9}?c/ 0
#=> 22 /abb.{0,9}c/ 0
#=> 23 /abb.{0,9}?c/ nil

In the above, we would expect patterns 4, 5, 16 and 17 to fail, but
not 23.

Notable is that pattern #15 succeeds (showing that a non-greedy range
matching any-char can match a lower-limit number of times) and that
pattern #11 succeeds (showing that a non-greedy range matching a
specific char can match zero number of times).

On May 5, 7:42 pm, James S. [email protected] wrote:

Ruby’s regular expression engine appears to act incorrectly when given a
non-greedy match-range of the form {m,n}?

I forgot to note, in my previous reply, that my test results are
against 1.8.6:
ruby 1.8.6 (2007-09-24 patchlevel 111) [i686-darwin9.1.0]

Ruby v1.9 (using a different regexp engine, “Oniguruma”) does not
suffer from the same problem.

On May 5, 8:59 pm, Phrogz [email protected] wrote:

suffer from the same problem.
Rubinius and JRuby don’t seem to suffer from it either.

Chris

Thank you Gavin and Chris for your verification. Gavin, you are right
that it is pared down from a real problem where a character class and
alphanumerics were necessary, thank you for your much better examples.
I’ll file a bug report against 1.8.6.

-James