Ruby Forum Ruby > Inconsistent results using Iconv

Posted by RSL ___ (rsl)
on 03.09.2007 00:41
(Received via mailing list)
Howdy. I'm working with Iconv and discovered that this code

  require "iconv"
  ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
    puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
  end

generates different results depending on how the code is loaded.
"ægis" and "straße" convert fine no matter how they're called but
"kierkegård" and "joão" only convert correctly when loaded or typed or
pasted within an open instance of irb. Sadly, poor "bjørk" never
manages to get correctly converted. At the bottom of the email, I have
pasted the results of the various methods I've tried to run this code.

Note: The last attempt I show below apparently _does_ work for a
friend of mine using Fedora 7. Just more weirdness? I dunno. My mind's
reeling. Time for a break. ;)

Thanks in advance,

RSL

Here's the various code runs...

rsl@sneaky ~ > irb
irb(main):001:0> load "omg.rb"
ægis => aegis
straße => strasse
kierkegård => kierkegard
joão => joao
bjørk => bj?rk

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > ruby omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > irb omg.rb
omg.rb(main):001:0> #!/usr/bin/env ruby
omg.rb(main):002:0* require "iconv"
=> true
omg.rb(main):003:0> ["ægis", "straße", "kierkegård", "joão",
"bjørk"].each do |word|
omg.rb(main):004:1*   puts "#{word} =>
#{Iconv.iconv('ascii//translit', 'utf-8', word)}"
omg.rb(main):005:1> end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
=> ["\303\246gis", "stra\303\237e", "kierkeg\303\245rd",
"jo\303\243o", "bj\303\270rk"]
omg.rb(main):006:0> exit

rsl@sneaky ~ > cat omg.rb | irb
#!/usr/bin/env ruby
require "iconv"
true
["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
  puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
["\303\246gis", "stra\303\237e", "kierkeg\303\245rd", "jo\303\243o",
"bj\303\270rk"]
exit
Posted by RSL ___ (rsl)
on 03.09.2007 13:29
(Received via mailing list)
I just realized that I didn't really ask a question there, did I?
Woops. I'd like to know what I'm doing wrong or perhaps just how to do
it right so that I get the same results. Do I need to manually include
another library that somehow isn't getting included in the other ways
I've tried? I'd really like to be able to count on my Iconv.iconv code
doing what I need all the time but it seems I can't at the moment. :(

Here's hoping someone can help me solve this puzzle.

RSL
Posted by Jimmy Kofler (koflerjim)
on 03.09.2007 20:29
> RSL ___ wrote:
> I just realized that I didn't really ask a question there, did I?
> Woops. I'd like to know what I'm doing wrong or perhaps just how to do
> it right so that I get the same results. Do I need to manually include
> another library that somehow isn't getting included in the other ways
> I've tried? I'd really like to be able to count on my Iconv.iconv code
> doing what I need all the time but it seems I can't at the moment. :(
> 
> Here's hoping someone can help me solve this puzzle.
> 
> RSL


Converting from utf-8 to iso-8859-1 seems to work though (source code 
file is encoded in UTF-8 and character set encoding of Terminal.app is 
set to ISO Latin 1, i.e. iso-8859-1).

...

p "#{word} => #{Iconv.iconv('iso-8859-1//translit', 'utf-8', word)}"

...


=>
irb(main):020:0> load "omg.rb"
"\303\246gis => \346gis"
"stra\303\237e => stra\337e"
"kierkeg\303\245rd => kierkeg\345rd"
"jo\303\243o => jo\343o"
"bj\303\270rk => bj\370rk"
=> true


Cheers

j.k.
Posted by Carlos (Guest)
on 03.09.2007 21:57
(Received via mailing list)
[Russell Norris <rsl@swimcommunity.org>, 2007-09-03 00.40 CEST]
[...]
> Howdy. I'm working with Iconv and discovered that this code
> 
>   require "iconv"
>   ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
>     puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
>   end
[...]
> rsl@sneaky ~ > ./omg.rb
> ægis => aegis
> straße => strasse
> kierkegård => kierkeg?rd
> joão => jo?o
> bjørk => bj?rk

Hi, Russell. Apparently the ASCII transliteration rules are defined in
locale data files, and not all locales define all of them (and some that
define it, do it differently). The resolution of this bug report 
explains
the situation a little more:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272

Good luck.
Posted by Daniel DeLorme (Guest)
on 04.09.2007 00:49
(Received via mailing list)
Carlos wrote:
>> ægis => aegis
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272
Also I get different results depending on platform:

Ubuntu/iconv (GNU libc) 2.3.6:
   ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8",
"caff\303\250")'
   => caff?

FreeBSD/iconv (GNU libiconv 1.9)
   ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8",
"caff\303\250")'
   => caff`e

This doesn't explain why you have different results in irb and ruby, but
it does show how unreliable Iconv translit can be. IMHO you'd be better
off using the Unicode gem if you want to decompose UTF8 strings.

Daniel