String encoding issues

I found several issues in string encoding. Here is the problem:

[[email protected] mysql]# irb -E ascii

I start irb with default external encoding ascii

irb(main):014:0> String.new.encoding
=> #Encoding:ASCII-8BIT
irb(main):015:0> “”.encoding
=> #Encoding:US-ASCII

I get different encodings when I initialize an empty string. Why?

irb(main):023:0> “\x80”.encoding
=> #Encoding:ASCII-8BIT
irb(main):024:0> “\x7F”.encoding
=> #Encoding:US-ASCII

It looks that if there is a ASCII value greater than 0x7F, it will use

ASCII-8BIT encoding. That is OK.

irb(main):005:0> new_str = “\xF1\xF2”
=> “\xF1\xF2”
irb(main):006:0> new_str.encoding
=> #Encoding:ASCII-8BIT
irb(main):007:0> s ="%c%c%c%c%c%s" % [49, 5, 245, 225, 1, new_str]
Encoding::CompatibilityError: incompatible character encodings: US-ASCII
and ASCII-8BIT
from (irb):7:in %' from (irb):7 from /bin/irb:12:in

Now I try to use a ASCII-8BIT to format another string, it raises

exception. Why?

irb(main):008:0> s ="%c%c%c%c%c%s" % [49, 5, 45, 25, 1, new_str]
=> “1\x05-\x19\x01\xF1\xF2”

I am very surprise that if I don’t use value > 0x7F to format, it can

handle it.

irb(main):012:0> s ="%c%c%c%c%c" % [49, 5, 245, 225, 1]
=> “1\x05\xF5\xE1\x01”
irb(main):013:0> s.encoding
=> #Encoding:US-ASCII

If I don’t put the ASCII-8BIT string to format, it also works. But I

am very surprise that even there is a non-ASCII char inside the string,
the encoding is US-ASCII. Why?

I figure out the first question.

[[email protected] mysql]# irb
irb(main):001:0> s = String.new
=> “”
irb(main):002:0> s.encoding
=> #Encoding:ASCII-8BIT
irb(main):003:0> puts Encoding.default_external.name
UTF-8

Ruby will always use ASCII-8BIT as encoding when you use String.new to
create a new String object.

Oliver P. wrote:

Ruby will always use ASCII-8BIT as encoding when you use String.new to
create a new String object.

Ugh. That’s another special case to add to
http://github.com/candlerb/string19/blob/master/string19.rb

However in practice it doesn’t matter much, because the empty string is
compatible.

irb(main):001:0> s1 = String.new
=> “”
irb(main):002:0> s2 = “groß”
=> “groß”
irb(main):003:0> s1.encoding
=> #Encoding:ASCII-8BIT
irb(main):004:0> s2.encoding
=> #Encoding:UTF-8
irb(main):005:0> s1 + s2
=> “groß”

And as for this which you found:

irb(main):003:0> s = “%c%c%c%c%c”.force_encoding(“US-ASCII”)
=> “%c%c%c%c%c”
irb(main):004:0> t = s % [49, 5, 245, 225, 1]
=> “1\x05\xF5\xE1\x01”
irb(main):005:0> t.encoding
=> #Encoding:US-ASCII

I think it’s just one of the many bugs in ruby 1.9.x, likely due to a
total lack of specification of the new behaviour for all methods which
accept or return strings (although if there’s no specification, I
suppose you can’t really argue it’s a bug; it can behave however it
likes)

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs