I’ve found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:
$KCODE = ‘u’
require ‘jcode’
text = “\xa3A\nB\n\xa3C\nxD\nE”
This pattern finds all lines that intuitively should match it.
puts “Pattern includes “(?:x|\xa3)?”:”
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }
This pattern finds all lines except the one containing the C, which is
contrary to my intuition. I’d expect it to match all lines or, if I
were
really paranoid about Unicode, I might expect it to match all but
the
lines containing A and C.
puts “Pattern includes “[x\xa3]”:”
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }
The output of this is:
Pattern includes “(?:x|ú)?”:
úA
B
úC
xD
E
Pattern includes “[xú]”:
úA
B
xD
E
Without the first two (Unicode-specifying) lines, the output is what I
expect:
Pattern includes “(?:x|ú)?”:
úA
B
úC
xD
E
Pattern includes “[xú]”:
úA
B
úC
xD
E
(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the ú matches ONLY where it’s
the very first thing in the string.
Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I’m doing?
It appears that you were spot-on with your guess about wonky things
happening in character classes. Seemingly hex escape codes aren’t
allowed there. You’ll have to either use a literal character, or if
that isn’t possible, do something ugly like this:
/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation
There might be another solution, hopefully so, but this should at least
work if nothing else turns up.
There might be another solution, hopefully so, but this should at least
work if nothing else turns up.
I hadn’t thought of that one - thanks for the suggestion! The simplest
(working) alternative I could think of was the parenthesised list of
individual characters as shown in the first half of the example code.
puts “Pattern includes “[x\xa3]”:”
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }
That is very weird indeed. It’s normal that your example doesn’t work,
because
\xa3 is NOT valid utf8. But I would’ve expected it to work if you used
the
correct utf8 sequence for “ú” ("\xc3\xba"), except it doesn’t!
Not really, because I don’t understand Oniguruma (the regexp engine);
I’m barely smart enough to use regexps. But seemingly, you can’t
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).
Regards,
Jordan
Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:
That is very weird indeed. It’s normal that your example doesn’t work, because
\xa3 is NOT valid utf8. But I would’ve expected it to work if you used the
correct utf8 sequence for “ú” ("\xc3\xba"), except it doesn’t!
That shouldn’t matter. He was matching the same hex escape he used in
his string (viz., \xa3). It shouldn’t matter whether it’s unicode or
just random data; the match should go through (or fail) in either case.
WTF? Can anyone explain this?
Not really, because I don’t understand Oniguruma (the regexp engine);
I’m barely smart enough to use regexps. But seemingly, you can’t
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).
I’ve played with the u option regex hack quite a while back (seemed to
be working pretty well even with some Japanese chars if i remember
correctly), so i just thought to throw it in as a tip.
Thanks, again, for the update to Nikolai W.'s extension!
Cheers,
Verno
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.