Unicode Question

Daly · April 23, 2009, 3:11am

Hello all,

I download a file from a website and one of the lines look like this:

row = “\0001\0002\0003\000\t\0002\0004\0008\0004\0005\000\t\000C\000o
\000m\000p\000U\000S\000A\000\t\000W
\0006\0003\0001\0000\0003\0000\0006\0000\0001\0000\0001\000\t
\0002\0000\0000\0009\000-\0000\0003\000-\0002\0007\000\t
\0001\0005\000:\0005\0001\000:\0000\0000\000\t\000U\000L\000T\000-\000D
\0002\000P\000K\000\t\0000\000.\0005\000\t\0001\000\t
\0000\000.\0000\0001\000\t\0002\0000\0000\0009\000-\0000\0003\000-
\0002\0008\000\t\0001\0003\000:\0003\0006\000:\0002\0004\000\r\000\n”

On my Mac, if I do:

Iconv.iconv(“UTF8”, “UCS-2”, row)

I get:

[“123\t24845\tCompUSA\tW63103060101\t2009-03-27\t15:51:00\tULT-D2PK
\t0.5\t1\t0.01\t2009-03-28\t13:36:24\r\n”]

Which is exactly right. On the production Linux box (Ubuntu 8.04),
doing the same thing yields:

[“ã„€ãˆ€ãŒ€à¤€ãˆ€ã€ã €ã€ã”€à¤€äŒ€æ¼€æ´€ç€€å”€åŒ€ä„€à¤€åœ€ã˜€ãŒ€ã„€ã€€ãŒ€ã€€ã˜€ã€€ã„€ã€€ã„€à¤€ãˆ€ã€€ã€€ã¤€â´€ã€€ãŒ€â´€ãˆ€ãœ€à¤€ã„€ã”€ã¨€ã”€ã„€ã¨€ã€€ã€€à¤€å”€ä°€å€â´€ä€ãˆ€å€€ä¬€à¤€ã€€â¸€ã”€à¤€ã„€à¤€
ã€€â¸€ã€€ã„€à¤€ãˆ€ã€€ã€€ã¤€â´€ã€€ãŒ€â´€ãˆ€ã €à¤€ã„€ãŒ€ã¨€ãŒ€ã˜€ã¨€ãˆ€ã€à´€à¨€”]

I figured it out and fixed it by doing:

Iconv.iconv(“UTF8”, “UCS-2BE”, row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

Thanks,
Ahmed

Daly · April 23, 2009, 3:44am

Hi,

At Thu, 23 Apr 2009 10:10:45 +0900,
Daly wrote in [ruby-talk:334732]:

I figured it out and fixed it by doing:

Iconv.iconv(“UTF8”, “UCS-2BE”, row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

It seems a bug of Ubuntu iconv. According to Unicode
Consortium, UCS-2 is defaulted to big endian if no BOM exists.

Explicit endianness anyway. Ruby 1.9 does not provide UCS
names without endians.

Daly · April 23, 2009, 4:38am

On Apr 22, 2009, at 8:10 PM, Daly wrote:

I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

I’ve been trying to covering character encodings with a heavy Ruby
slant on my blog for just this reason:

http://blog.grayproductions.net/articles/understanding_m17n

My coverage is about 98% complete now, in case you want to browse a bit.

Of course, in this case it just seems that you hit a bug as Nobu said.

James Edward G. II

Daly · April 23, 2009, 3:16pm

Hi James,

75% of my research was made on your blog

Thanks for your answers.

Daly · April 24, 2009, 1:06pm

On Thu, 2009-04-23 at 11:37 +0900, James G. wrote:

http://blog.grayproductions.net/articles/understanding_m17n

My coverage is about 98% complete now, in case you want to browse a bit.

Hi James,

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings. This was one of the more
annoying problems that I hit with trying to use Iconv directly for ruby
1.8. In the past, I’d used the Mozilla character detection library (in
Java) when doing processing of XML in order to ensure that the file
encoding matched the declaration in the <?xml > header.

Fortunately, I recently found the rchardet gem which is the port of this
library to Ruby, and it’s helped me deal with giving more appropriate
encoding information to Iconv.

Usage goes something like this:

91 cd = CharDet.detect(text)
92 encoding = cd[‘encoding’]
93 puts “Reading detected encoding ‘#{encoding}’ text with
confidence: %. 2f%%” % [cd[‘confidence’] * 100]
94 iconv = Iconv.new(“UTF-8”, encoding)
95 puts “Conversion to UTF-8 successful.”

This time, I needed this sort of thing when trying to ensure I could
load arbitrary text files from unknown sources into GTK+ widgets.

I’ve actually rarely been in the case where I knew the encoding of the
input I was trying to deal with if it wasn’t the same as the system
default, but maybe that’s just me…

The referenced blog post looks really good. Thanks for your efforts.

Cheers,

ast

Daly · April 24, 2009, 4:41pm

On Apr 24, 2009, at 6:05 AM, Andrew S. Townley wrote:

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings.

Thanks for the information. I just added a comment with a link to
your email here:

http://blog.grayproductions.net/articles/general_encoding_strategies

James Edward G. II