UTF8 hell

Xavier_NoSSSSlle · February 2, 2010, 12:15pm

Hello,
I’m trying to deal with Ruby flaws with encoding, which I thought
would be almost past with Ruby 1.9. I managed to find a solution for
Ruby 1.8 and thought I did for Ruby 1.9…but in fact, no !

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

“str.encoding()” gives me “ASCII-8BIT”…so, I thought one of these
lines would solve the problem
str.replace(Iconv.iconv(“UTF8”, “ascii”, self).join())
OR
self.encode!(‘UTF-8’)

But they don’t !
First one: in iconv': "\xE8te pour luth" (Iconv::IllegalSequence) Second one: inencode!’: “\xE8” from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

The base string is “Oeuvre complète pour luth” and displays well in
PHPMyAdmin.

Any idea ?
TIA,

Xavier_NoSSSSlle · February 2, 2010, 1:55pm

On Tuesday 02 February 2010, Xavier Noëlle wrote:

|str.replace(Iconv.iconv(“UTF8”, “ascii”, self).join())
|
|Any idea ?
|TIA,

I’m not sure, but basing on my experience, it may be that the string are
indeed stored as UTF-8, but the library you use to read from the
database
doesn’t take care of informing ruby of the fact, so ruby assumes it is a
generic array of bytes (which means, ruby thinks the string has encoding
ASCII-8BIT, which is the same as BINARY).

If this is the case, you don’t need to transcode the string (which is
what
encode does), but simply tell ruby which is the correct encoding, using
the
force_encoding method.

I hope this helps

Stefano

Xavier_NoSSSSlle · February 2, 2010, 2:27pm

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

There are several pieces to this. Even if the DB encoding and collation
is utf8, doublecheck that the client connection is utf8 as well
(“encoding: utf8” in database.yml for a Rails app I think).

self.encode!(‘UTF-8’)

str.force_encoding(‘UTF-8’) is what you want to use I think.

Xavier_NoSSSSlle · February 2, 2010, 3:16pm

2010/2/2 David P. [email protected]:

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well (“encoding: utf8” in database.yml for a Rails app I think).

Not a Rails app

str.force_encoding(‘UTF-8’) is what you want to use I think.

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

Xavier_NoSSSSlle · February 2, 2010, 3:51pm

2010/2/2 Xavier Noëlle [email protected]:

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String’s byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

Xavier_NoSSSSlle · February 23, 2010, 3:03pm

How does python solve this?

Xavier_NoSSSSlle · February 23, 2010, 12:12pm

2010/2/2 Robert K. [email protected]:

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String’s byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

I dug into the problem and ended up with this line:
self.force_encoding(‘UTF-8’)
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

BUT (because, there’s a but…), for some reason I don’t understand,
some strings are unwilling to work:

Example:
puts self => médicals
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub(‘ruby’, ‘zorglub’)) on this string leads to: `gsub’: invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

TIA,

Xavier_NoSSSSlle · February 23, 2010, 4:06pm

On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro M. [email protected]
wrote:

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

233 for e accent acute would be valid for ISO-8859-1 encoding, not
UTF-8.

–
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: Rick DeNatale - Developer - IBM | LinkedIn

Xavier_NoSSSSlle · February 23, 2010, 3:41pm

Hi,

In message “Re: [ENCODING] UTF8 hell”
on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle
[email protected] writes:

|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
|
|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
|self.gsub(‘ruby’, ‘zorglub’)) on this string leads to: `gsub’: invalid
|byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

          matz.

Xavier_NoSSSSlle · February 23, 2010, 4:20pm

2010/2/23 Yukihiro M. [email protected]:

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
begin
self.unpack(‘U*’)
rescue
return false
end
return true
end

if isUTF8()
self.force_encoding(‘UTF-8’)
else
self.force_encoding(‘ISO-8859-1’)
self.encode!(‘UTF-8’)
end

This (ugly) quickfix works for what I need, but I don’t know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

Xavier_NoSSSSlle · February 23, 2010, 6:05pm

Yukihiro M. wrote:

In message “Re: [ENCODING] UTF8 hell”
on Tue, 23 Feb 2010 20:10:20 +0900, Xavier NoÃ«lle [email protected] writes:
|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
|
|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
|self.gsub(‘ruby’, ‘zorglub’)) on this string leads to: `gsub’: invalid
|byte sequence in UTF-8 (ArgumentError).
233 is not a valid UTF-8 character. The byte sequence for mÃ©dicals is
<109 195 169 100 105 99 97 108 115>.

A general hint for debugging encoding troubles: the UTF-8 encoding
guarantees that every Unicode codepoint is either encoded into a
single octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) or into a sequence of 2 to 6
octets, all of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

A single octet with its MSB set to 1 can never be a valid UTF-8
character, it can only be part of a multi-octet character, i.e. it
must appear either immediately before or after or between another
octet with its MSB set. However, in your string there is no
multi-octet character sequence, there is only a single character with
its MSB set (the second one with the decimal value 233), so you can
see without having to look at any code tables that this string
cannot possibly be a UTF-8 string.

As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC
8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC
8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC
8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9,
ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or
Windows-1252 string (it’s impossible to tell, but makes no difference
in this case). My guess is on ISO-8859-15.

[This property is BTW what makes UTF-8 compatible with ASCII, because
it guarantees that every Unicode character which is also in ASCII,
will be encoded the same way as it would be in ASCII and every Unicode
character which is not in ASCII will be encoded as a sequence of
octets each of which is illegal in ASCII. It also provides some
robustness against 8-bit encodings such as the ISO8859 family, because
statistically it is very likely that somewhere in the text, there
will be a single octet with its MSB set (in this case, it’s the Ã© and
in my name it’s the Ã¶), which is surrounded by octets with their MSB
cleared, which cannot ever happen in UTF-8.]

jwm

Xavier_NoSSSSlle · February 23, 2010, 6:20pm

A general hint for debugging encoding troubles: the UTF-8 encoding
guarantees that every Unicode codepoint is either encoded into a
single octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) or into a sequence of 2 to 6
octets, all of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

Xavier_NoSSSSlle · February 23, 2010, 10:46pm

Perry S. wrote:

A general hint for debugging encoding troubles: the UTF-8 encoding
guarantees that every Unicode codepoint is either encoded into a
single octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) or into a sequence of 2 to 6
octets, all of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).
Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

Nope.

First off: I was wrong, the longest encoding is actually 4 octets,
not 6. (I was confused by the algorithm: the algorithm actually allows
for up to 8 bytes, but because of the way Unicode characters are
allocated, and UTF-8 is defined, it is guaranteed that there will
never be more than 4.)

The encodings look like this:

0xxxxxxx for ASCII
110xxxxx 10xxxxxx for U+80 to U+7FF
1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

This is actually pretty clever:

you can always tell whether you are inside a multibyte sequence or
not because of the high bit,
you can always tell whether a byte in the sequence is the first one
or a later one, because the first one always starts with 11 and the
other ones always start with 10 and
you can always tell how long a sequence is by the number of 1 bits
in the start byte: two-byte sequences start with two 1s, three-byte
sequences start with three 1s and four-byte sequences start with
four 1s.

This means that you can usually re-synchronize pretty easily from the
middle of a corrupted network transmission, for example. You can also
jump over bytes if you are counting the length.

jwm

Xavier_NoSSSSlle · February 24, 2010, 5:13am

On Wed, Feb 24, 2010 at 12:18 AM, Xavier NoÃ«lle
[email protected] wrote:

Â end
Â return true
end

if isUTF8()
Â self.force_encoding(‘UTF-8’)
else
Â self.force_encoding(‘ISO-8859-1’)
Â self.encode!(‘UTF-8’)
end

string = “\xE8te pour luth”

“\xE8te pour luth”

string.encoding

#Encoding:UTF-8

string.valid_encoding?

false

string.force_encoding(‘ISO-8859-1’)

“Ã¨te pour luth”

string.valid_encoding?

true

string.upcase

“Ã¨TE POUR LUTH”

Xavier_NoSSSSlle · February 23, 2010, 11:10pm

On 23.02.2010 12:10, Xavier Noëlle wrote:

2010/2/2 Robert K. [email protected]:

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String’s byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

I dug into the problem and ended up with this line: self.force_encoding(‘UTF-8’)
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

The string you show below does not look like UTF-8 encoded, probably
rather ISO-8859-1 or such. If you enforce an encoding you leave the
byte sequence untouched. This leads to the kind of error you describe
below.

Where am I wrong ?

As far as I can see 233 starts a three byte sequence

I did not dig deeper but it may be that by forcing UTF-8 on an ISO
something encoded string you broke it.

Kind regards

robert