Unicode Question


#1

Hello all,

I download a file from a website and one of the lines look like this:

row = “\0001\0002\0003\000\t\0002\0004\0008\0004\0005\000\t\000C\000o
\000m\000p\000U\000S\000A\000\t\000W
\0006\0003\0001\0000\0003\0000\0006\0000\0001\0000\0001\000\t
\0002\0000\0000\0009\000-\0000\0003\000-\0002\0007\000\t
\0001\0005\000:\0005\0001\000:\0000\0000\000\t\000U\000L\000T\000-\000D
\0002\000P\000K\000\t\0000\000.\0005\000\t\0001\000\t
\0000\000.\0000\0001\000\t\0002\0000\0000\0009\000-\0000\0003\000-
\0002\0008\000\t\0001\0003\000:\0003\0006\000:\0002\0004\000\r\000\n”

On my Mac, if I do:

Iconv.iconv(“UTF8”, “UCS-2”, row)

I get:

[“123\t24845\tCompUSA\tW63103060101\t2009-03-27\t15:51:00\tULT-D2PK
\t0.5\t1\t0.01\t2009-03-28\t13:36:24\r\n”]

Which is exactly right. On the production Linux box (Ubuntu 8.04),
doing the same thing yields:

[“㄀㈀㌀ऀ㈀㐀㠀㐀㔀ऀ䌀漀洀瀀唀匀䄀ऀ圀㘀㌀㄀ ㌀ 㘀 ㄀ ㄀ऀ㈀  㤀ⴀ ㌀ⴀ㈀㜀ऀ㄀㔀㨀㔀㄀㨀  ऀ唀䰀吀ⴀ䐀㈀倀䬀ऀ ⸀㔀ऀ㄀ऀ
 ⸀ ㄀ऀ㈀  㤀ⴀ ㌀ⴀ㈀㠀ऀ㄀㌀㨀㌀㘀㨀㈀㐀ഀ਀”]

I figured it out and fixed it by doing:

Iconv.iconv(“UTF8”, “UCS-2BE”, row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

Thanks,
Ahmed


#2

Hi,

At Thu, 23 Apr 2009 10:10:45 +0900,
Daly wrote in [ruby-talk:334732]:

I figured it out and fixed it by doing:

Iconv.iconv(“UTF8”, “UCS-2BE”, row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

It seems a bug of Ubuntu iconv. According to Unicode
Consortium, UCS-2 is defaulted to big endian if no BOM exists.

Explicit endianness anyway. Ruby 1.9 does not provide UCS
names without endians.


#3

On Apr 22, 2009, at 8:10 PM, Daly wrote:

I fixed it by reading about encoding
and trial and error, so I’m left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

I’ve been trying to covering character encodings with a heavy Ruby
slant on my blog for just this reason:

http://blog.grayproductions.net/articles/understanding_m17n

My coverage is about 98% complete now, in case you want to browse a bit.

Of course, in this case it just seems that you hit a bug as Nobu said.

James Edward G. II


#4

Hi James,

75% of my research was made on your blog :slight_smile:

Thanks for your answers.


#5

On Thu, 2009-04-23 at 11:37 +0900, James G. wrote:

http://blog.grayproductions.net/articles/understanding_m17n

My coverage is about 98% complete now, in case you want to browse a bit.

Hi James,

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings. This was one of the more
annoying problems that I hit with trying to use Iconv directly for ruby
1.8. In the past, I’d used the Mozilla character detection library (in
Java) when doing processing of XML in order to ensure that the file
encoding matched the declaration in the <?xml > header.

Fortunately, I recently found the rchardet gem which is the port of this
library to Ruby, and it’s helped me deal with giving more appropriate
encoding information to Iconv.

Usage goes something like this:

91 cd = CharDet.detect(text)
92 encoding = cd[‘encoding’]
93 puts “Reading detected encoding ‘#{encoding}’ text with
confidence: %. 2f%%” % [cd[‘confidence’] * 100]
94 iconv = Iconv.new(“UTF-8”, encoding)
95 puts “Conversion to UTF-8 successful.”

This time, I needed this sort of thing when trying to ensure I could
load arbitrary text files from unknown sources into GTK+ widgets.

I’ve actually rarely been in the case where I knew the encoding of the
input I was trying to deal with if it wasn’t the same as the system
default, but maybe that’s just me… :wink:

The referenced blog post looks really good. Thanks for your efforts.

Cheers,

ast


#6

On Apr 24, 2009, at 6:05 AM, Andrew S. Townley wrote:

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings.

Thanks for the information. I just added a comment with a link to
your email here:

http://blog.grayproductions.net/articles/general_encoding_strategies

James Edward G. II