Regex ^ beginning not strong?

Hi,

I’ve some more regex questions. I wrote a pattern to check for valid
regexes and inspect the parts (we all have our reasons for the things we
do:) It wasn’t working so I went down to simpler and simpler patterns,
but I’m a bit surprised at the way Ruby 1.9 is handling the regexes. I
tested the same pattern in Perl and it came out with the answers I’d
expect.

Is this down to me using perl regexes for so long, or is there something
I’m missing about Ruby’s implementation? It appears ^ at the beginning
of a string doesn’t bind as strongly as I’d expect.

I believe this test should fail as should be bound to the
beginning of the string by the ^ , and the match result is a little bit
crazy - shouldn’t the main capture be “d\d” if it’s following the
logical route it’s chosen?
$ ruby -e ’
md =
/^(?m)?(?.)(?.+?)\g/.match( %q!/\d\d\d! )
puts md.inspect

#<MatchData “/\d” mors:nil delim:“d” pat:"\">

Here I add on a trailing slash to the string, and (I believe) it should
bring me back what’s between the / / :
$ ruby -e ’
md =
/^(?m)?(?.)(?.+?)\g/.match( %q!/\d\d\d/! )
puts md.inspect

#<MatchData “/\d” mors:nil delim:“d” pat:"\">

Here’s the first string in perl 5.12 :
$ perl -e ’
if ( q(/\d\d\d) =~ /^(?m)?(?.)(?.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print “$key => $value\n”;
}
}

<nothing here, what I’d expect>

And here it is with the “valid” string:
$ perl -e ’
if ( q(/\d\d\d/) =~ /^(?m)?(?.)(?.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print “$key => $value\n”;
}
}

pat => \d\d\d
delim => /

These are the answers I’d expect.

Even this seems unexpected to me, if I remove the then surely ^
should bind to the beginning???
$ ruby -e ’
md = /^(?.)(?.+?)\g/.match(
%q!/\d\d\d/! )
puts md.inspect

#<MatchData “/\d” delim:“d” pat:"\">

These work as I’d expect by using the end of line $ :
$ ruby -e ’
md = /^(?.)(?.+?)\g$/.match(
%q!/\d\d\d/! )
puts md.inspect

#<MatchData “/\d\d\d/” delim:"/" pat:"\d\d\d">

$ ruby -e ’
md =
/^(?m)?(?.)(?.+?)\g$/.match( %q!/\d\d\d/! )
puts md.inspect

#<MatchData “/\d\d\d/” mors:nil delim:"/" pat:"\d\d\d">

And finally, if I remove the caret but leave the $ I get the answer I’d
expect (or that I’m looking for) :
$ ruby -e ’
md =
/(?m)?(?.)(?.+?)\g$/.match( %q!/\d\d\d/! )
puts md.inspect

#<MatchData “/\d\d\d/” mors:nil delim:"/" pat:"\d\d\d">

Regards,
Iain

On 26 Jul 2010, at 21:40, Robert K. wrote:

Thanks for checking that. While searching for more information on the
Oniguruma engine I noticed that there was a CPAN library for running it
under Perl, so I installed it and ran the same regexes against the perl
engine, and it had the same results as Ruby. This indicates that it’s a
problem with the engine and not something Ruby is doing along the way,
so I’ll file a report with the Oniguruma team and include all your tests
too and see what happens.

With Oniguruma:

$ perl -Mre::engine::Oniguruma -e ’
if ( q(/\d\d\d/) =~ /^(?.)(?.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print “$key => $value\n”;
}
}

Usual Perl engine:

$ perl -e ’
if ( q(/\d\d\d/) =~ /^(?.)(?.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print “$key => $value\n”;
}
}

pat => \d\d\d
delim => /

Regards,
Iain

On 07/26/2010 06:01 PM, Iain B. wrote:

something I’m missing about Ruby’s implementation? It appears ^ at
the beginning of a string doesn’t bind as strongly as I’d expect.

I believe this test should fail as should be bound to the
beginning of the string by the ^ , and the match result is a little
bit crazy - shouldn’t the main capture be “d\d” if it’s following
the logical route it’s chosen? $ ruby -e ’ md =
/^(?m)?(?.)(?.+?)\g/.match( %q!/\d\d\d! )
puts md.inspect ’ #<MatchData “/\d” mors:nil delim:“d” pat:"\">

I think you found a bug - probably related to referring to back
references to named capturing groups:

irb(main):013:0> s = %q!/\d\d\d!
=> “/\d\d\d”

irb(main):027:0> r = /^(?m)?(?.)(?.+?)/
=> /^(?m)?(?.)(?.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData “/\” mors:nil delim:"/" pat:"\">

This must not match at all:

irb(main):029:0> r = /^(?m)?(?.)(?.+?)\g/
=> /^(?m)?(?.)(?.+?)\g/
irb(main):030:0> md = r.match s
=> #<MatchData “/\d” mors:nil delim:“d” pat:"\">

It seems to work better with numbered capturing groups

irb(main):027:0> r = /^(?m)?(?.)(?.+?)/
=> /^(?m)?(?.)(?.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData “/\” mors:nil delim:"/" pat:"\">
irb(main):029:0> r = /^(?m)?(?.)(?.+?)\g/
=> /^(?m)?(?.)(?.+?)\g/
irb(main):030:0> md = r.match s
=> #<MatchData “/\d” mors:nil delim:“d” pat:"\">

Normal greediness:

irb(main):035:0> r = /^(m)?(.)(.+)\2/
=> /^(m)?(.)(.+)\2/
irb(main):036:0> md = r.match s
=> nil

This works:

irb(main):038:0> /^(m)?(.)(.+)\2/.match ‘abbba’
=> #<MatchData “abbba” 1:nil 2:“a” 3:“bbb”>

Maybe the numbering gets out of order if we try to mix:

irb(main):039:0> /^(?m)?(.)(.+)\2/.match ‘abbba’
SyntaxError: (irb):39: numbered backref/call is not allowed. (use name):
/^(?m)?(.)(.+)\2/
from /usr/local/bin/irb19:12:in <main>' irb(main):040:0> /^(?<delim>m)?(.)(.+)\k<2>/.match 'abbba' SyntaxError: (irb):40: numbered backref/call is not allowed. (use name): /^(?<delim>m)?(.)(.+)\k<2>/ from /usr/local/bin/irb19:12:in
irb(main):041:0>

irb(main):047:0> RUBY_VERSION
=> “1.9.1”
irb(main):048:0> RUBY_PATCHLEVEL
=> 376

Frankly, I never used named capturing groups yet (simply for habit and
compatibility). It was probably a good choice so far.

Kind regards

robert