Converting between ASCII-8BIT and UTF-8

addis_a · November 4, 2014, 7:52pm

I have an issue where I have a string of arbitrary data in Ruby, which
is encoded using ASCII-8BIT (the output from File.binread) that needs to
be sent to another system running Python. On the receiving end it
expects the data to be UTF-8 encoded [1].

On the Ruby side I’ve tried such things as:

asciidata = File.binread(’/path/to/file’)

data = asciidata.encode(‘UTF-8’)

also:

data = asciidata.dup.encode(‘UTF-8’)

While the resulting data reference claims it’s valid UTF-8 encoded data,
the receiver says it’s ot and chokes with:

‘utf8’ codec can’t decode byte 0x9b in position 10: invalid start byte

I’m a bit lost on this topic. How do I get Ruby to verify that the data
is properly encoded as UTF-8 and, if not, transcode it before sending?

[1] Actually, our wire protocol requires such data to be UTF-8 encoded

mcpierce · November 4, 2014, 10:03pm

Darryl Pierce wrote in post #1161719:

I have an issue where I have a string of arbitrary data in Ruby, which
is encoded using ASCII-8BIT (the output from File.binread) that needs to
be sent to another system running Python. On the receiving end it
expects the data to be UTF-8 encoded [1].

UTF-8 is an encoding of UNICODE values. there is not bijection
between binary code (the result of encoding) and UNICODE : some binary
value are not in utf-8 scope.

data = asciidata.dup.encode(‘UTF-8’)

succeed in encode do not mean that data contain valid value…
try to match a regexp, you will detect errors

‘utf8’ codec can’t decode byte 0x9b in position 10: invalid start byte

it is impossible to interpret binary data as utf-8. you must change your
protocol:

change utf-8 to binary
precode your data : using base64 will work (binary=>base64=>uft-8).

regards,
regis

mcpierce · November 5, 2014, 2:15pm

Hi Darryl,

“Darryl L. Pierce” [email protected] writes:

I have an issue where I have a string of arbitrary data in Ruby, which
is encoded using ASCII-8BIT (the output from File.binread) that needs to
be sent to another system running Python. On the receiving end it
expects the data to be UTF-8 encoded [1].

ASCII-8BIT is not really an “encoding”. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

If you can guarantee you have only 7-Bit characters in your ASCII-8Bit
string, i.e. real 7bit ASCII, then you can be happy as any 7bit ASCII
string is automatically a valid UTF-8 string due to the first characters
in the UTF-8 encoding are equal to 7bit ASCII. In that case no
transcoding is necessary.

The way you have phrased your question it is to be assumed that you
receive arbitrary binary data, e.g. you are reading in image files like
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
UTF-8?”. It should be obvious that these are two entirely different
concepts.

If you aren’t reading in JPEG files with your Ruby program, but rather
textual data, then it will be encoded in some encoding, maybe
Windows-1252,
which you can then tell Ruby by using a construct such as this one:

data = File.open(“yourfile”, “r:Windows-1252”){|f| f.read}

Ruby itself cannot know the encoding of textual input unless you tell it
the encoding as shown above. Once you have obtained a correctly tagged
string, you can then transcode that to UTF-8:

utf8 = data.encode(“UTF-8”)

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

Vale,
Quintus

–
Blog: http://www.quintilianus.eu

I will reject HTML emails. | Ich akzeptiere keine HTML-Nachrichten.
|
Use GnuPG for mail encryption: | GnuPG für Mail-Verschlüsselung:
http://www.gnupg.org | The GNU Privacy Guard

mcpierce · November 5, 2014, 4:25pm

On Tue, Nov 04, 2014 at 09:12:18PM +0100, Quintus wrote:

data, i.e. raw bytes, not something text-based. Hence, transcoding from
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
UTF-8?”. It should be obvious that these are two entirely different
concepts.

Yeah, sorry. Please don’t let my example confuse things. I was simply
using loading a binary file from disk as a way of getting a bunch of
random data. A bad example on my part.

What I have is this: my project has the concept of a message and a
messenger that can send/receive messages. The wire protocol requires
that any string of data be UTF-8 encoded.

In the Message class we have body= and body for setting and getting the
body of the message. The user has the ability to set the format of the
message body and then set the content, and one such format is named
“DATA” for arbitrary data which must be UTF-8 encoded.

utf8 = data.encode(“UTF-8”)

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

It makes my head ache, for certain.

Given the above parameters for our project, what I’m thinking is that I
could either do is 1) check the value passed into body= and, if it’s not
UTF-8 encoded, raise an exception that the caller needs to pass in
proper encoding, or 2) find a way to transcode data to UTF-8.

My preference would be for (1) and to put the burden on the caller to
ensure their data is fit for our wire protocol. I don’t want to have to
do any transcoding unless it’s baked into the language.

mcpierce · November 5, 2014, 6:04pm

On Tue, Nov 04, 2014 at 09:12:18PM +0100, Quintus wrote:

data, i.e. raw bytes, not something text-based. Hence, transcoding from
JPEG. So, you are basically ascing, “how do I transcode a JPEG image to
the encoding as shown above. Once you have obtained a correctly tagged
string, you can then transcode that to UTF-8:

utf8 = data.encode(“UTF-8”)

As a helpful piece of extra information, you can use
String#valid_encoding? to test if the string you have is entirely valid
in the encoding it has been tagged with.

Does this clear up things for you?

Okay, so this discussion and a little reading and I think I understand
the issue a bit better.

So that said I now have an extension on the issue. To solve our problem
we peek at the String#encoding value for the string:

if it says it’s UTF-8 we treat it as such, otherwise
we try to encode it as UTF-8 with value.force_encoding(‘UTF-8’) and
check the valid_encoding? result, otherwise
we treat it as a binary string.

These work for us. But the problem is how to do this on Ruby 1.8 (we
support 1.8.7 up)?

mcpierce · November 6, 2014, 12:04am

Subject: Re: Converting between ASCII-8BIT and UTF-8
Date: Tue 04 Nov 14 05:22:25PM -0500

Quoting Darryl L. Pierce ([email protected]):

What I have is this: my project has the concept of a message and a
messenger that can send/receive messages. The wire protocol requires
that any string of data be UTF-8 encoded.

Strings are just sequences of bits, like other data. But while with
binary , ASCII and older per-country encodings you take the bits one
byte at a time, in UTF-8 you may have to take them one, two or three
bytes at a time. And there are illegal sequences.

In Ruby, if I recall correctly since 2.0, a string is always
associated with an encoding, UTF-8 as default.

s=’?bc’
p s.encoding => #Encoding:UTF-8

There are two main operations you can do with yiur string re:
encoding: either you can transcode:

s.encode!(‘ISO-8859-1’)

In this case, the first character (lower case ‘a’ with acute accent,
represented as two bytes \xC3\xA1 in UTF-8) is changed to its
equivalent in ISO8859.1, the single character \xE1.

The second operation allows you to keep the exact sequence of bytes,
but tells the system that it has to interpret the string with another
encoding:

s.force_encoding(‘ISO-8859-1’)

In this case, the new string, interpreted as ISO8859.1, will be

??bc

because in ISO8859.1 \xC3 is A with tilde, and \xA1 is the inverted
exclamation mark.

Every time, you have the possibility to inspect the exact bytes that
make up a string:

b=s.bytes
p b => [195, 161, 98, 99]

In your case, since you receive stuff, it should be the other part’s
responsibility to make sure the strings are proper UTF-8. What you
should not do is mangle it. If I were you, in order not to be mistaken
I’d get the string as array of bytes, and write a method that’s
something like this:

def massage_input(array)
s=array.pack(‘c*’).force_encoding(‘UTF-8’)
unless(s.valid_encoding?())
[COMPLAIN IN SOME WAY]
end
s
end

and then make sure I do not modify the string I receive anymore.

Carlo

mcpierce · November 5, 2014, 6:34pm

On Nov 4, 2014, at 12:12, Quintus [email protected] wrote:

ASCII-8BIT is not really an encoding. It means that you have binary
data, i.e. raw bytes, not something text-based. Hence, transcoding from
ASCII-8BIT, whose alias is BINARY btw., is not a meaningful operation.

No. It means you CAN have binary data. It doesn’t mean you do. It is a
meaningful operation if the data being re-encoded is meaningful to the
new encoding.