Forum: Ruby Unicode Question

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
8d38199a01d2243e313c48456c6ddff9?d=identicon&s=25 Daly (Guest)
on 2009-04-23 03:11
(Received via mailing list)
Hello all,

I download a file from a website and one of the lines look like this:

row = "\0001\0002\0003\000\t\0002\0004\0008\0004\0005\000\t\000C\000o
\000m\000p\000U\000S\000A\000\t\000W
\0006\0003\0001\0000\0003\0000\0006\0000\0001\0000\0001\000\t
\0002\0000\0000\0009\000-\0000\0003\000-\0002\0007\000\t
\0001\0005\000:\0005\0001\000:\0000\0000\000\t\000U\000L\000T\000-\000D
\0002\000P\000K\000\t\0000\000.\0005\000\t\0001\000\t
\0000\000.\0000\0001\000\t\0002\0000\0000\0009\000-\0000\0003\000-
\0002\0008\000\t\0001\0003\000:\0003\0006\000:\0002\0004\000\r\000\n"

On my Mac, if I do:

Iconv.iconv("UTF8", "UCS-2", row)

I get:

["123\t24845\tCompUSA\tW63103060101\t2009-03-27\t15:51:00\tULT-D2PK
\t0.5\t1\t0.01\t2009-03-28\t13:36:24\r\n"]

Which is exactly right. On the production Linux box (Ubuntu 8.04),
doing the same thing yields:

 ["㄀㈀㌀ऀ㈀㐀㠀㐀㔀ऀ䌀漀洀瀀唀匀䄀ऀ圀㘀㌀㄀ ㌀ 㘀 ㄀ ㄀ऀ㈀  㤀ⴀ ㌀ⴀ㈀㜀ऀ㄀㔀㨀㔀㄀㨀  ऀ唀䰀吀ⴀ䐀㈀倀䬀ऀ ⸀㔀ऀ㄀ऀ
 ⸀ ㄀ऀ㈀  㤀ⴀ ㌀ⴀ㈀㠀ऀ㄀㌀㨀㌀㘀㨀㈀㐀ഀ਀"]

I figured it out and fixed it by doing:

Iconv.iconv("UTF8", "UCS-2BE", row)

Which works on both environments. I fixed it by reading about encoding
and trial and error, so I'm left with a working solution, but not
knowing why it works on the Mac but not in Linux. Could someone please
explain?

Thanks,
Ahmed
F1d6cc2b735bfd82c8773172da2aeab9?d=identicon&s=25 Nobuyoshi Nakada (nobu)
on 2009-04-23 03:44
(Received via mailing list)
Hi,

At Thu, 23 Apr 2009 10:10:45 +0900,
Daly wrote in [ruby-talk:334732]:
> I figured it out and fixed it by doing:
>
> Iconv.iconv("UTF8", "UCS-2BE", row)
>
> Which works on both environments. I fixed it by reading about encoding
> and trial and error, so I'm left with a working solution, but not
> knowing why it works on the Mac but not in Linux. Could someone please
> explain?

It seems a bug of Ubuntu iconv.  According to Unicode
Consortium, UCS-2 is defaulted to big endian if no BOM exists.

Explicit endianness anyway.  Ruby 1.9 does not provide UCS
names without endians.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-23 04:38
(Received via mailing list)
On Apr 22, 2009, at 8:10 PM, Daly wrote:

> I fixed it by reading about encoding
> and trial and error, so I'm left with a working solution, but not
> knowing why it works on the Mac but not in Linux. Could someone please
> explain?

I've been trying to covering character encodings with a heavy Ruby
slant on my blog for just this reason:

http://blog.grayproductions.net/articles/understanding_m17n

My coverage is about 98% complete now, in case you want to browse a bit.

Of course, in this case it just seems that you hit a bug as Nobu said.

James Edward Gray II
8d38199a01d2243e313c48456c6ddff9?d=identicon&s=25 Ahmed El-Daly (Guest)
on 2009-04-23 15:16
(Received via mailing list)
Hi James,

75% of my research was made on your blog :)

Thanks for your answers.
324ce80d9dbe9417607192038fb880bf?d=identicon&s=25 Andrew S. Townley (Guest)
on 2009-04-24 13:06
(Received via mailing list)
On Thu, 2009-04-23 at 11:37 +0900, James Gray wrote:
> http://blog.grayproductions.net/articles/understanding_m17n
>
> My coverage is about 98% complete now, in case you want to browse a bit.

Hi James,

I was wondering if you wanted to add a note on how to deal with
potentially unknown character encodings.  This was one of the more
annoying problems that I hit with trying to use Iconv directly for ruby
1.8.  In the past, I'd used the Mozilla character detection library (in
Java) when doing processing of XML in order to ensure that the file
encoding matched the declaration in the <?xml > header.

Fortunately, I recently found the rchardet gem which is the port of this
library to Ruby, and it's helped me deal with giving more appropriate
encoding information to Iconv.

Usage goes something like this:

 91       cd = CharDet.detect(text)
 92       encoding = cd['encoding']
 93       puts "Reading detected encoding '#{encoding}' text with
confidence: %.    2f%%" % [cd['confidence'] * 100]
 94       iconv = Iconv.new("UTF-8", encoding)
 95       puts "Conversion to UTF-8 successful."

This time, I needed this sort of thing when trying to ensure I could
load arbitrary text files from unknown sources into GTK+ widgets.

I've actually rarely been in the case where I knew the encoding of the
input I was trying to deal with if it wasn't the same as the system
default, but maybe that's just me... ;)

The referenced blog post looks really good.  Thanks for your efforts.

Cheers,

ast
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-24 16:41
(Received via mailing list)
On Apr 24, 2009, at 6:05 AM, Andrew S. Townley wrote:

> I was wondering if you wanted to add a note on how to deal with
> potentially unknown character encodings.

Thanks for the information.  I just added a comment with a link to
your email here:

http://blog.grayproductions.net/articles/general_e...

James Edward Gray II
This topic is locked and can not be replied to.