$KCODE, -KU and CLR strings

murena · March 15, 2010, 9:48pm

Hi everyone,
please consider this snippet:

$KCODE = “U”
puts “æ—¥æœ¬èªž”.to_clr_string.length

When I run it by launching ir.exe without any option I get 9 as an
output (each character in that string is actually made up of 3 bytes
with UTF-8 encoding), and when I do the same with the -KU option being
passed to ir.exe I get 3. Aside from the fact that I think that 3 is
to be considered the right behaviour here, shouldn’t the sole $KCODE =
“U” have the same effect of starting ir.exe with the -KU option?

Thanks,
Daniele

–
Daniele A.
http://www.clorophilla.net/
http://twitter.com/JoL1hAHN

murena · March 15, 2010, 11:39pm

Setting $KCODE = “U” doesn’t actually affect the encoding of the literal
in the same compilation unit. It only affects literals that are parsed
after the KCODE is set.

$KCODE = “U”
x = “æ—¥æœ¬èªž”
p x.Encoding # => ASCII-8BIT since the current compilation unit (a
file) was parsed using BINARY encoding
p x.size # => 9 bytes

y = eval(’“æ—¥æœ¬èªž”’)
p y.Encoding # => KCODE: UTF8
p y.size # => 9 since String#size in MRI 1.8.6 doesn’t understand
encodings, it counts in bytes

c = x.to_clr_string # this is essentially creating a string whose non
ASCII characters are not correctly encoded in UTF8 (they are UTF8 bytes
widened to 16bits)
p c.size # => 9 characters
p c.Encoding # => UTF-8 since CLR string doesn’t hold on an
encoding. When you ask for its bytes we need to use some encoding.
# Maybe we could choose UTF16 but MRI 1.8.6 has at least some
support for.

d = y.to_clr_string # correctly encoded string
d c.Encoding # UTF-8
p d.size # 3 characters

Encodings in 1.8.6 are not very well supported and it is difficult to
implement good interop between CLR and MRI strings. It would get better
in the next version of IronRuby which will target compatibility with
1.9.

Tomas

murena · March 16, 2010, 9:28pm

On Mon, Mar 15, 2010 at 23:36, Tomas M.
[email protected] wrote:

Setting $KCODE = “U” doesn’t actually affect the encoding of the literal in the same compilation unit. It only affects literals that are parsed after the KCODE is set.

I was missing the subtle difference, now it makes sense to me. Thanks

–
Daniele A.
http://www.clorophilla.net/
http://twitter.com/JoL1hAHN