Bug in regex engine ? Must be

Hi,

I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

require ‘test/unit’

class TestRegexBug < Test::Unit::TestCase

def test_bug

hours = "pon-čet"

assert(hours =~ /[č]et/i)
assert(hours =~ /čet/i)
assert(hours =~ /-čet/i)
assert(hours =~ /[cč]et/i)
assert(hours =~ /-[č]et/i)

end

end

As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…

Can you comment?

thank you,
david

On Mar 3, 2008, at 2:24 PM, D. Krmpotic wrote:

Hi,

I’m using Ruby 1.8.6, and I just discovered something rather
interesting, here is a test:

$KCODE = ‘UTF8’
require ‘jcode’

assert(hours =~ /-čet/i)
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’
or
‘[]’ from the regex, it works…

Can you comment?

thank you,
david

Ruby is not natively aware of unicode, but you can get all these to
pass if you give it the $KCOCDE hint.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Great info… completely forgot that this is available…
thank you
david

$KCODE = ‘UTF8’
require ‘jcode’

2008/3/3, D. Krmpotic [email protected]:

end

As you can see, this only happens with unicode letters… (the last test
fails)… I’m used to the fact that //i doesn’t work for unicode chars
and I already know that you need two dots to match one of these… But
this problem is different and weirder, because what triggers it is a
minus sign before the square brackets… if you remove either the ‘-’ or
‘[]’ from the regex, it works…

In the regex [è] is a character class with two bytes. So
Ruby tries to match a minus followed by one of the bytes
out of “è” followed by “et”. So the regex would match
“pon-\304et” or “pon-\215et”, but not “pon-\304\215et”.

Stefan