StringScanner and UTF-8 in ruby 1.9

I’ve recently switched to ruby 1.9 and I’m having problems using its
multilingual features. Now I’m having a problem with StringScanner. It
seems
that, when working with multi-byte encodings, such as utf-8, its pos
method
returns a position in bytes, rather than in characters. For instance,
the
following code (when the source encoding is set to utf-8) outputs 2:

s = StringScanner.new “èa”
s.scan(/./)
puts s.pos

(If you can’t see it correctly, the string passed to StringScanner.new
is made
of two characters: the first is an “e” with a grave accent and the
second is a
“a”). If I replace the first character with an ASCII character, the
output is

  1. This clearly hints that StringScanner#pos gives a position in terms
    of
    bytes rather than characters. Does anyone know whether there’s a way to
    have
    it return the position in characters rather than in bytes, or to convert
    it
    from bytes to characters?

I noticed that StringScanner has a get_byte methd and a getch method,
which
return the next byte and the next character respectively, so I can’t
help
wondering why something similar hasn’t been provided for pos. Do you
think
there’s a reason for this, or should it be reported as a bug? (or am I
missing
something obvious?)

Thanks in advance

Stefano

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs