On Jun 22, 2007, at 6:55 AM, Mariusz Pękala wrote:
“”]
string after ampty string.
increased, skipping one character, to prevent infinite loop of
matching
nothing again.
This behavour may be considered weird, or buggy, and probably
results
are not what was expected.
A great example which I do consider to be buggy. The similar
example from perl is something like:
$ perl -e ‘$h = “hello”; $h =~ s/.*?/[$&]/g; print “$h\n”;’
[][h][][e][][l][][l][][o][]
It matches the empty string at the beginning, between each character,
and at the end, but it does consume the actual characters of the
string. Even if not what one would anticipate, it’s not too hard to
justify the result. (Something that can’t be said for ruby’s
[“”,“”,“”,“”,“”,“”].)
The other versions from perl are enlightening:
$ perl -e ‘$h = “hello”; $h =~ s/.?/[$&]/g; print “$h\n”;’
[h][e][l][l][o][]
$ perl -e ‘$h = “hello”; $h =~ s/.*/[$&]/g; print “$h\n”;’
[hello][]
Both succeed in a zero-character match at the end. These are
equivalent in ruby (1.8.5):
$ ruby -e ‘puts “hello”.scan(/.?/).inspect’
[“h”, “e”, “l”, “l”, “o”, “”]
$ ruby -e ‘puts “hello”.scan(/.*/).inspect’
[“hello”, “”]
I thought I’d see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:
irb> require ‘oniguruma’
=> true
irb> reluctant = Oniguruma::ORegexp.new(‘.?')
=> /.?/
irb> greedy = Oniguruma::ORegexp.new(’.')
=> /./
irb> greedyq = Oniguruma::ORegexp.new(‘.?’)
=> /.?/
irb> reluctant.scan(“hello”)
=> [#MatchData:0x10b9aa4, #MatchData:0x10b9a7c, #<MatchData:
0x10b9a68>, #MatchData:0x10b9a40, #MatchData:0x10b9a18,
#MatchData:0x10b99f0]
irb> reluctant.scan(“hello”).map{|md|md[0]}
=> [“”, “”, “”, “”, “”, “”]
irb> greedy.scan(“hello”).map{|md|md[0]}
=> [“hello”, “”]
irb> greedyq.scan(“hello”).map{|md|md[0]}
=> [“h”, “e”, “l”, “l”, “o”, “”]
OK, the same result as the ruby Regexp. Including, that .*? produces
[“”]*6 which is the “before each character and at the end” locations
of the zero-length matches from perl, but the individual single-byte
matches are missing.
I presume that there’s some justification for these behaviors, but I
can’t figure out what it might be.
-Rob
But look at:
irb(main):038:0> “hello”.scan /h(.)e/
=> [[“”]]
irb(main):039:0> “hello”.scan /h(.)(.)(.)(.)(.)e/
=> [[“”, “”, “”, “”, “”]]
Here ‘nothing’ matches many times, and definitely this is the
expected
behaviour.
I agree that those results are exactly what I’d expect.
–
No virus found in this outgoing message.
Checked by ‘grep -i virus $MESSAGE’
Trust me.
Rob B. http://agileconsultingllc.com
[email protected]