StringScanner and UTF-8 in ruby 1.9

crocco · September 16, 2009, 10:28am

I’ve recently switched to ruby 1.9 and I’m having problems using its
multilingual features. Now I’m having a problem with StringScanner. It
seems
that, when working with multi-byte encodings, such as utf-8, its pos
method
returns a position in bytes, rather than in characters. For instance,
the
following code (when the source encoding is set to utf-8) outputs 2:

s = StringScanner.new “Ã¨a”
s.scan(/./)
puts s.pos

(If you can’t see it correctly, the string passed to StringScanner.new
is made
of two characters: the first is an “e” with a grave accent and the
second is a
“a”). If I replace the first character with an ASCII character, the
output is

This clearly hints that StringScanner#pos gives a position in terms
of
bytes rather than characters. Does anyone know whether there’s a way to
have
it return the position in characters rather than in bytes, or to convert
it
from bytes to characters?

I noticed that StringScanner has a get_byte methd and a getch method,
which
return the next byte and the next character respectively, so I can’t
help
wondering why something similar hasn’t been provided for pos. Do you
think
there’s a reason for this, or should it be reported as a bug? (or am I
missing
something obvious?)

Thanks in advance

Stefano