How to convert string to binary and back in Ruby 1.9?

joe · September 2, 2009, 12:56am

I’m using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº” (Russian for “helper”). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.

But when I go to irb, I get this:

irb(main):060:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.encoding
=> #Encoding:UTF-8
irb(main):061:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.force_encoding(“ISO-8859-1”)
=> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”
irb(main):063:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.force_encoding(“ISO-8859-1”).encoding
=> #Encoding:ISO-8859-1
irb(main):064:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.force_encoding(“ISO-8859-1”).bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]

So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?

Note also:

irb(main):066:0> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”.encode(‘BINARY’)
Encoding::UndefinedConversionError: “\xD0\xBF” from UTF-8 to
ASCII-8BIT
from (irb):66:in encode' from (irb):66 from /usr/local/bin/irb:12:in’

So apparently in Ruby 1.9, binary isn’t really binary?

I banged my head for a while, and then tried it in python3.
Completely easy:

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’
‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’)
b’\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba’

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’).decode(‘latin_1’)
‘ÃÂ¿ÃÂ¾ÃÂ¼ÃÂ¾ÃÂ½ÃÂ¸ÃÂº’

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’).decode(‘latin_1’)
‘ÃÂ¿ÃÂ¾ÃÂ¼ÃÂ¾ÃÂ½ÃÂ¸ÃÂº’

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’).decode(‘latin_1’).encode(‘latin_1’)
b’\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba’

So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?

joe · September 2, 2009, 1:18am

El MiÃ©rcoles, 2 de Septiembre de 2009, Joe escribiÃ³:

=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
ASCII-8BIT

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’).decode(‘latin_1’)
encoding?

AFAIK String#force_encoding doesn’t re-encode the string, but just
changes its
properties (the encoding).

In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

joe · September 2, 2009, 2:06am

On Sep 1, 4:18 pm, Iñaki Baz C. [email protected] wrote:

El Miércoles, 2 de Septiembre de 2009, Joe escribió:

In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

–
Iñaki Baz C. [email protected]

OK, so String#force_encoding just changes the encoding, but does not
alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

This seems crazy to me.

OK, so maybe String#force_encoding is crazy and broken or just won’t
be able to do what I want. Your suggestion was that String#encode is
the method for changing the string. Of course I tried that one, and
it errors because there is no Cyrillic alphabet in ISO-8859-1.

Is there really no way to go from bytes to string? That’s all I want!

joe · September 2, 2009, 3:44am

I’m finding it pretty frustrating.

It is, especially as Ruby 1.8 behaviour is less annoying IMHO in this
regard.

joe · September 2, 2009, 2:40am

On Sep 1, 4:53 pm, Joe [email protected] wrote:

Is there really no way to go from bytes to string? That’s all I want!

OK, I found the Array#pack method. At first glance, it seemed to be
exactly what I was looking for. I could do str.bytes.to_a to turn a
String into raw bytes, and Array#pack will turn them right back into a
String.

But go to

http://ruby-doc.org/core-1.9/classes/Array.html

The method is missing from the 1.9 documentation. Has it been
deprecated? The 1.8 documentation doesn’t help much, because it seems
the function is entirely unaware of the String encoding.

I guess Ruby’s m17n is brand spanking new, and it shows, huh? I’m
finding it pretty frustrating.

joe · September 2, 2009, 7:42am

On Sep 1, 6:38 pm, Joe [email protected] wrote:

deprecated?
I don’t believe so. I don’t know why it’s not in the docs there, but
it’s in my local ri:

Slim2:~ phrogz$ ri -T Array#pack

Array#pack
arr.pack ( aTemplateString ) → aBinaryString

 From Ruby 1.9.1

 Packs the contents of _arr_ into a binary sequence according to

the
directives in aTemplateString (see the table below) Directives
A,'' a,‘’ and Z'' may be followed by a count, which gives the width of the resulting field. The remaining directives also may take a count, indicating the number of array elements to convert. If the count is an asterisk (+*+‘’), all remaining array
elements
will be converted. Any of the directives +sSiIlL+'' may be followed by an underscore (+_+‘’) to use the underlying
platform’s native size for the specified type; otherwise, they
use
a platform-independent size. Spaces are ignored in the template
string. See also +String#unpack+.

    a = [ "a", "b", "c" ]
    n = [ 65, 66, 67 ]
    a.pack("A3A3A3")   #=> "a  b  c  "
    a.pack("a3a3a3")   #=> "a\000\000b\000\000c\000\000"
    n.pack("ccc")      #=> "ABC"

 Directives for +pack+.

  Directive    Meaning
  ---------------------------------------------------------------
      @     |  Moves to absolute position
      A     |  arbitrary binary string (space padded, count is

joe · September 2, 2009, 11:39am

Joe wrote:

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

UTF-8 is a binary encoding of Unicode codepoints, so it’s a sequence of
binary bytes by definition. And you get the same as your Python code:

irb(main):001:0> ‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’
=> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”
irb(main):002:0> ‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.bytes.each { |x| print "%02x " % x }
d0 bf d0 be d0 bc d0 be d0 bd d0 b8 d0 ba => “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”
irb(main):004:0> ‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.force_encoding(“BINARY”)
=> “\xD0\xBF\xD0\xBE\xD0\xBC\xD0\xBE\xD0\xBD\xD0\xB8\xD0\xBA”

I think what’s confusing you is this:

irb(main):005:0> str = ‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’
=> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”
irb(main):006:0> str.force_encoding(“ISO-8859-1”)
=> “Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº”

Here, Ruby is doing something strange. The string is tagged as a
sequence of ISO-8859-1 characters, but this sequence of bytes is being
squirted as-is to a UTF-8 terminal, and so the UTF-8 terminal is
displaying them as the original characters.

You can get the behaviour you want like this, by transcoding to UTF-8:

irb(main):009:0> str.encode(“UTF-8”)
=> “ÃÂ¿ÃÂ¾ÃÂ¼ÃÂ¾ÃÂ½ÃÂ¸ÃÂº”

Given that irb is running in a UTF-8 environment, it is arguable that
STDOUT should have an external encoding of UTF-8, which means text
should be transcoded to UTF-8 automatically.

That is, you can also get the behaviour you want from this standalone
program:

encoding: UTF-8

STDOUT.set_encoding “UTF-8” # << THE MAGIC BIT

str = ‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’
str.force_encoding(“ISO-8859-1”)
puts str

It seems inconsistent to me that STDOUT doesn’t get its
external_encoding set automatically.

So apparently in Ruby 1.9, binary isn’t really binary?

Correct. In Ruby 1.9, binary is ASCII. I hate this.

I have documented a lot of the gory details at

Thanks for bringing another anomoly to my attention.

joe · September 2, 2009, 11:19am

On Sep 2, 2009, at 2:55 AM, Joe wrote:

alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

Brian C. did a pretty thorough documentation of 1.9’s M17N at

. There are also multiple sources of documentation on the subject at
http://blog.grayproductions.net/articles/what_ruby_19_gives_us
(Edward G.) and elsewhere.

I’m also more comfortable with how 1.8 behaves but then again I’m a
newbie here.

Patrick

joe · September 2, 2009, 12:35pm

Joe schrieb:

I’m using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

What means to “turn a string from binary to ISO-8859-1”?

‘Ð¿Ð¾Ð¼Ð¾Ð½Ð¸Ðº’.encode(‘utf_8’).decode(‘latin_1’)
‘ÃÂ¿ÃÂ¾ÃÂ¼ÃÂ¾ÃÂ½ÃÂ¸ÃÂº’

What Python does here is it encodes the string (from its internal
unicode format) to an utf-8 binary-string and then converts it again
into its internal unicode-format (interpreting it as latin-1 string).
Finally it puts it out to the console which means that it converts it
again (to probably utf-8) for the Mac Os Terminal. This is an important
point you should keep in mind.

So this is quiet similar to what ruby does except that ruby makes no
conversion to an general internal format and no special conversion for
the terminal.

If you would put the results out to a file you would have the same
result.

Regards, R.

joe · September 2, 2009, 12:08pm

I think I understand it now. The following was confusing me initially:

>> str = "Ã¼ber"
=> "Ã¼ber"
>> str.force_encoding("ISO-8859-1")
=> "Ã¼ber"
>> str = "groÃŸ"
=> "groÃŸ"
>> str.force_encoding("ISO-8859-1")
=> "groï¿½\x9F"

It appears this is just an artefact of String#inspect. String#inspect
“knows” that \x80 to \x9F are not printable characters in ISO-8859-1, so
converts them to the backslash hex form. This breaks the UTF-8 display
by splitting the character, but of course only for strings which contain
bytes in that range.

You still get the string displayed as UTF-8 using puts without inspect:

>> puts str
groÃŸ
=> nil

It works if you set the encoding for STDOUT inside irb, in which case
you’ll get everything transcoded to your terminal’s character set.

STDOUT.set_encoding “locale”
=> #<IO:>

str = “Ã¼ber”
=> “Ã¼ber”

str.force_encoding(“ISO-8859-1”)
=> “ÃƒÂ¼ber”

puts str
ÃƒÂ¼ber
=> nil