Inconsistent results using Iconv

Howdy. I’m working with Iconv and discovered that this code

require “iconv”
[“ægis”, “straße”, “kierkegÃ¥rd”, “joão”, “bjørk”].each do |word|
puts “#{word} => #{Iconv.iconv(‘ascii//translit’, ‘utf-8’, word)}”
end

generates different results depending on how the code is loaded.
“ægis” and “straße” convert fine no matter how they’re called but
“kierkegÃ¥rd” and “joão” only convert correctly when loaded or typed or
pasted within an open instance of irb. Sadly, poor “bjørk” never
manages to get correctly converted. At the bottom of the email, I have
pasted the results of the various methods I’ve tried to run this code.

Note: The last attempt I show below apparently does work for a
friend of mine using Fedora 7. Just more weirdness? I dunno. My mind’s
reeling. Time for a break. :wink:

Thanks in advance,

RSL

Here’s the various code runs…

rsl@sneaky ~ > irb
irb(main):001:0> load “omg.rb”
ægis => aegis
straße => strasse
kierkegård => kierkegard
joão => joao
bjørk => bj?rk

rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > ruby omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

rsl@sneaky ~ > irb omg.rb
omg.rb(main):001:0> #!/usr/bin/env ruby
omg.rb(main):002:0* require “iconv”
=> true
omg.rb(main):003:0> [“ægis”, “straße”, “kierkegÃ¥rd”, “joão”,
“bjørk”].each do |word|
omg.rb(main):004:1* puts “#{word} =>
#{Iconv.iconv(‘ascii//translit’, ‘utf-8’, word)}”
omg.rb(main):005:1> end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
=> ["\303\246gis", “stra\303\237e”, “kierkeg\303\245rd”,
“jo\303\243o”, “bj\303\270rk”]
omg.rb(main):006:0> exit

rsl@sneaky ~ > cat omg.rb | irb
#!/usr/bin/env ruby
require “iconv”
true
[“ægis”, “straße”, “kierkegÃ¥rd”, “joão”, “bjørk”].each do |word|
puts “#{word} => #{Iconv.iconv(‘ascii//translit’, ‘utf-8’, word)}”
end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
["\303\246gis", “stra\303\237e”, “kierkeg\303\245rd”, “jo\303\243o”,
“bj\303\270rk”]
exit

I just realized that I didn’t really ask a question there, did I?
Woops. I’d like to know what I’m doing wrong or perhaps just how to do
it right so that I get the same results. Do I need to manually include
another library that somehow isn’t getting included in the other ways
I’ve tried? I’d really like to be able to count on my Iconv.iconv code
doing what I need all the time but it seems I can’t at the moment. :frowning:

Here’s hoping someone can help me solve this puzzle.

RSL

RSL ___ wrote:
I just realized that I didn’t really ask a question there, did I?
Woops. I’d like to know what I’m doing wrong or perhaps just how to do
it right so that I get the same results. Do I need to manually include
another library that somehow isn’t getting included in the other ways
I’ve tried? I’d really like to be able to count on my Iconv.iconv code
doing what I need all the time but it seems I can’t at the moment. :frowning:

Here’s hoping someone can help me solve this puzzle.

RSL

Converting from utf-8 to iso-8859-1 seems to work though (source code
file is encoded in UTF-8 and character set encoding of Terminal.app is
set to ISO Latin 1, i.e. iso-8859-1).

p “#{word} => #{Iconv.iconv(‘iso-8859-1//translit’, ‘utf-8’, word)}”

=>
irb(main):020:0> load “omg.rb”
“\303\246gis => \346gis”
“stra\303\237e => stra\337e”
“kierkeg\303\245rd => kierkeg\345rd”
“jo\303\243o => jo\343o”
“bj\303\270rk => bj\370rk”
=> true

Cheers

j.k.

[Russell N. [email protected], 2007-09-03 00.40 CEST]
[…]

Howdy. I’m working with Iconv and discovered that this code

require “iconv”
[“ægis”, “straße”, “kierkegård”, “joão”, “bjørk”].each do |word|
puts “#{word} => #{Iconv.iconv(‘ascii//translit’, ‘utf-8’, word)}”
end
[…]
rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk

Hi, Russell. Apparently the ASCII transliteration rules are defined in
locale data files, and not all locales define all of them (and some that
define it, do it differently). The resolution of this bug report
explains
the situation a little more:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272

Good luck.

Carlos wrote:

ægis => aegis
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272
Also I get different results depending on platform:

Ubuntu/iconv (GNU libc) 2.3.6:
ruby -riconv -e ‘puts Iconv.iconv(“US-ASCII//TRANSLIT”, “UTF-8”,
“caff\303\250”)’
=> caff?

FreeBSD/iconv (GNU libiconv 1.9)
ruby -riconv -e ‘puts Iconv.iconv(“US-ASCII//TRANSLIT”, “UTF-8”,
“caff\303\250”)’
=> caff`e

This doesn’t explain why you have different results in irb and ruby, but
it does show how unreliable Iconv translit can be. IMHO you’d be better
off using the Unicode gem if you want to decompose UTF8 strings.

Daniel