Raw bytes in 1.9

Kless · July 27, 2009, 10:15am

I want to stick raw bytes (0-255) into a variable.

This will work in Ruby 1.8 since that it always assumes that the
characters in strings are exactly bytes. But I’m not sure about Ruby
1.9 as it has per-string character encodings.

Kless · July 27, 2009, 1:22pm

Hi –

On Mon, 27 Jul 2009, Kless wrote:

I want to stick raw bytes (0-255) into a variable.

This will work in Ruby 1.8 since that it always assumes that the
characters in strings are exactly bytes. But I’m not sure about Ruby
1.9 as it has per-string character encodings.

In 1.9 you can iterate through strings by line, byte, character, or
codepoint. I’m not sure exactly what you want to do but have a look at
String#bytes (aka #each_byte).

David

Kless · July 27, 2009, 1:31pm

I need store raw strings as this one:
“V-\243\230mJ\262.\031\023-4\301\324\241Y”
and I would to know if there will any problem with Ruby 1.9

Kless · July 27, 2009, 2:44pm

Hi –

On Mon, 27 Jul 2009, Kless wrote:

I need store raw strings as this one:
“V-\243\230mJ\262.\031\023-4\301\324\241Y”
and I would to know if there will any problem with Ruby 1.9

$ ruby191 -e ‘p “V-\243\230mJ\262.\031\023-4\301\324\241Y”’
“V-\xA3\x98mJ\xB2.\x19\x13-4\xC1\xD4\xA1Y”

Why not install Ruby 1.9.1, or get an account on ruby-versions.net?

David

Kless · July 27, 2009, 3:58pm

Kless wrote:

I need store raw strings as this one:
“V-\243\230mJ\262.\031\023-4\301\324\241Y”
and I would to know if there will any problem with Ruby 1.9

The answer is, “that depends”: Ruby 1.9’s string handling is extremely
complicated.

If the string is a literal within the program source, then adding a
comment

encoding: ASCII-8BIT

as the very first line of your program (or the second line if you have a
shebang line) will make literals have this encoding by default. Having
said that, strings with backslash-escapes like that will probably get
ASCII-8BIT by default.

If the string comes from reading a file, then you need to open it in
binary mode: File.open(“xxx”,“rb”) { |f| … }
If the string comes from reading from a socket, then I believe it will
be ASCII-8BIT by default
If the string comes from reading STDIN, then you will have to be very
careful; for safety you need something like

STDIN.set_encoding “ASCII-8BIT”

Your program may or may not work without these changes, because Ruby
1.9’s behaviour at runtime depends on settings in your environment. That
is, the same program with the same data might work on one computer but
crash on another computer. Using the above incantations is your first
line of defense against this stupidity.

Then you need to be sure that every single method that you call in other
people’s libraries, which takes string arguments or returns string
values, behaves in the way you want. For example, if you call
Library.foo and it returns a string whose encoding is UTF-8 and contains
characters with the high bit set, and you try to concatenate it with one
of your own binary strings, the program will crash.

Here’s a somewhat contrived example:

-------- main.rb (your program) --------

encoding: ASCII-8BIT

require ‘library’
binary_data = “\xff\xee\xdd”
msg = Library.err_to_str
binary_data << [msg.bytesize].pack(“N”)
binary_data << msg

-------- library.rb (someone else’s code that you don’t control)

encoding: UTF-8

module Library
def self.err_to_str
“Ã¼ber-error”
end
end

$ ruby19 main.rb
main.rb:7:in `’: incompatible character encodings: ASCII-8BIT and
UTF-8 (Encoding::CompatibilityError)

Your only way to protect against this is to force encodings at every
point where two strings of differing provenance might encounter each
other. e.g.

msg = Library.err_to_str
binary_data << [msg.bytesize].pack(“N”)
msg.force_encoding “ASCII-8BIT”
binary_data << msg

Beware also that ruby 1.9’s documentation is often either missing or
misleading when it comes to character encodings. For example, ri19
Array#pack says:

  Directive    Meaning
  ---------------------------------------------------------------
      @     |  Moves to absolute position
      A     |  arbitrary binary string (space padded, count is

width)
a | arbitrary binary string (null padded, count is width)

So you might expect that an arbitrary String can be packed using a*:

encoding: ASCII-8BIT

require ‘library’
binary_data = “\xff\xee\xdd”
msg = Library.err_to_str
binary_data << [msg.bytesize,msg].pack(“Na*”) # CRASH
puts binary_data.inspect

No, you still need a msg.force_encoding “ASCII-8BIT” before the pack.

If all this scares you - and it does me - then remember that staying
with ruby 1.8 is a reasonable alternative. Ruby 1.8.6 is going to be
maintained for a long time going forward, thanks to the people at
EngineYard and Phusion Passenger.

HTH,

Brian.

Kless · July 27, 2009, 4:04pm

Brian C. wrote:

-------- main.rb (your program) --------

encoding: ASCII-8BIT

require ‘library’
binary_data = “\xff\xee\xdd”
msg = Library.err_to_str
binary_data << [msg.bytesize].pack(“N”)
binary_data << msg

-------- library.rb (someone else’s code that you don’t control)

encoding: UTF-8

module Library
def self.err_to_str
“Ã¼ber-error”
end
end

$ ruby19 main.rb
main.rb:7:in `’: incompatible character encodings: ASCII-8BIT and
UTF-8 (Encoding::CompatibilityError)

I should add: if ruby 1.9 always gave an exception when an ASCII-8BIT
string encountered a UTF-8 String, it wouldn’t be a problem: your unit
tests would pick up the failure quickly.

But maybe in this library you’re using, 99% of the error message don’t
have any extended characters (i.e. those with the top bit set). Those
will work fine, even if tagged as UTF-8. It’s only on the occasion where
the library decides to return a string which is tagged UTF-8 and
contains extended characters that the runtime crash will occur - and
this means you’re always wondering whether you have sufficient coverage.

As a workaround, you might have to add extra unit tests which stub out
the library and force it to return a message with high-bit characters in
it, and check that your program behaves as expected. But mocking every
single library API which might return a string is really painful.

Kless · July 27, 2009, 11:18pm

What would be cool would be a ruby 1.9 where the whole encoding stuff is
completely optional - so that things would work like in ruby 1.8

Kless · July 27, 2009, 6:04pm

Having
said that, strings with backslash-escapes like that will probably get
ASCII-8BIT by default.

P.S: to check this you must actually write and run a standalone program
file. irb is not a good predictor of behaviour, nor is piping a program
to ruby on stdin.

$ irb19 --simple-prompt

“\xff”.encoding
=> #Encoding:UTF-8

^D

$ ruby19
p “\xff”.encoding
^D
#Encoding:UTF-8

$ cat >test.rb
p “\xff”.encoding
^D
$ ruby19 test.rb
#Encoding:ASCII-8BIT
$

This is with:

$ ruby19 -v
ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]

compiled from source under Ubuntu Jaunty.