Unicode and Character Classes -- a bug?

rrwhite · September 19, 2006, 12:10pm

Hi,

I’ve found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:

$KCODE = ‘u’
require ‘jcode’

text = “\xa3A\nB\n\xa3C\nxD\nE”

This pattern finds all lines that intuitively should match it.

puts “Pattern includes “(?:x|\xa3)?”:”
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }

This pattern finds all lines except the one containing the C, which is

contrary to my intuition. I’d expect it to match all lines or, if I

were

really paranoid about Unicode, I might expect it to match all but

the

lines containing A and C.

puts “Pattern includes “[x\xa3]”:”
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

The output of this is:

Pattern includes “(?:x|Ãº)?”:
ÃºA
B
ÃºC
xD
E
Pattern includes “[xÃº]”:
ÃºA
B
xD
E

Without the first two (Unicode-specifying) lines, the output is what I
expect:

Pattern includes “(?:x|Ãº)?”:
ÃºA
B
ÃºC
xD
E
Pattern includes “[xÃº]”:
ÃºA
B
ÃºC
xD
E

(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the Ãº matches ONLY where it’s
the very first thing in the string.

Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I’m doing?

Thanks!

Richard

rrwhite · September 19, 2006, 2:46pm

Hi Richard,

It appears that you were spot-on with your guess about wonky things
happening in character classes. Seemingly hex escape codes aren’t
allowed there. You’ll have to either use a literal character, or if
that isn’t possible, do something ugly like this:

/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

Regards,
Jordan

rrwhite · September 19, 2006, 3:55pm

Jordan Callicoat wrote:

/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

I hadn’t thought of that one - thanks for the suggestion! The simplest
(working) alternative I could think of was the parenthesised list of
individual characters as shown in the first half of the example code.

rrwhite · September 21, 2006, 2:02am

Richard Wiseman wrote:

puts “Pattern includes “[x\xa3]”:”
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

That is very weird indeed. It’s normal that your example doesn’t work,
because
\xa3 is NOT valid utf8. But I would’ve expected it to work if you used
the
correct utf8 sequence for “Ãº” ("\xc3\xba"), except it doesn’t!

$KCODE=‘u’
=> “u”
text = “\xc3\xbaA\nB\n\xc3\xbaC\nxD\nE”
=> “ÃºA\nB\nÃºC\nxD\nE”
text.scan(/^[xÃº]?[A-Z]$/)
=> [“ÃºA”, “B”, “ÃºC”, “xD”, “E”]
text.scan(/^[x\xc3\xba]?[A-Z]$/)
=> [“B”, “xD”, “E”]

WTF? Can anyone explain this?

rrwhite · September 21, 2006, 10:25am

Jordan Callicoat wrote:
Daniel DeLorme wrote:

…

WTF? Can anyone explain this?

Not really, because I don’t understand Oniguruma (the regexp engine);
I’m barely smart enough to use regexps. But seemingly, you can’t
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan

Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

rrwhite · September 21, 2006, 5:16am

Daniel DeLorme wrote:

That is very weird indeed. It’s normal that your example doesn’t work, because
\xa3 is NOT valid utf8. But I would’ve expected it to work if you used the
correct utf8 sequence for “ú” ("\xc3\xba"), except it doesn’t!

That shouldn’t matter. He was matching the same hex escape he used in
his string (viz., \xa3). It shouldn’t matter whether it’s unicode or
just random data; the match should go through (or fail) in either case.

WTF? Can anyone explain this?

Not really, because I don’t understand Oniguruma (the regexp engine);
I’m barely smart enough to use regexps. But seemingly, you can’t
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan

rrwhite · September 21, 2006, 10:57am

Verno Miller wrote:

Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai W.'s extension
(http://rubyforge.org/projects/char-encodings).

Regards,
Jordan

rrwhite · September 21, 2006, 12:01pm

Verno Miller wrote:

Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

NP

For some info on Oniguruma btw I’ve run across this page:

サービス終了のお知らせ

And thank YOU for this Verno! Oniguruma cheet sheet. That’s sweet!!

Regards,
Jordan

rrwhite · September 21, 2006, 11:37am

Jordan Callicoat wrote:
Verno Miller wrote:

Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai W.'s extension
(http://rubyforge.org/projects/char-encodings).

Regards,
Jordan

Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

For some info on Oniguruma btw I’ve run across this page:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

I’ve played with the u option regex hack quite a while back (seemed to
be working pretty well even with some Japanese chars if i remember
correctly), so i just thought to throw it in as a tip.

Thanks, again, for the update to Nikolai W.'s extension!

Cheers,
Verno