Bug in regex engine ? Must be

dxk3355 · March 3, 2008, 8:24pm

Hi,

I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

require ‘test/unit’

class TestRegexBug < Test::Unit::TestCase

def test_bug

hours = "pon-Äet"

assert(hours =~ /[Ä]et/i)
assert(hours =~ /Äet/i)
assert(hours =~ /-Äet/i)
assert(hours =~ /[cÄ]et/i)
assert(hours =~ /-[Ä]et/i)

end

As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…

Can you comment?

thank you,
david

dxk3355 · March 3, 2008, 11:47pm

On Mar 3, 2008, at 2:24 PM, D. Krmpotic wrote:

Hi,

I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

$KCODE = ‘UTF8’
require ‘jcode’

assert(hours =~ /-Äet/i)
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’
or
‘[]’ from the regex, it works…

Can you comment?

thank you,
david

Ruby is not natively aware of unicode, but you can get all these to
pass if you give it the $KCOCDE hint.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

dxk3355 · March 7, 2008, 10:30pm

Great info… completely forgot that this is available…
thank you
david

$KCODE = ‘UTF8’
require ‘jcode’

dxk3355 · March 8, 2008, 1:42am

2008/3/3, D. Krmpotic [email protected]:

end

As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…

In the regex [è] is a character class with two bytes. So
Ruby tries to match a minus followed by one of the bytes
out of “è” followed by “et”. So the regex would match
“pon-\304et” or “pon-\215et”, but not “pon-\304\215et”.

Stefan