Howdy. I'm working with Iconv and discovered that this code
require "iconv"
["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
end
generates different results depending on how the code is loaded.
"ægis" and "straße" convert fine no matter how they're called but
"kierkegård" and "joão" only convert correctly when loaded or typed or
pasted within an open instance of irb. Sadly, poor "bjørk" never
manages to get correctly converted. At the bottom of the email, I have
pasted the results of the various methods I've tried to run this code.
Note: The last attempt I show below apparently _does_ work for a
friend of mine using Fedora 7. Just more weirdness? I dunno. My mind's
reeling. Time for a break. ;)
Thanks in advance,
RSL
Here's the various code runs...
rsl@sneaky ~ > irb
irb(main):001:0> load "omg.rb"
ægis => aegis
straße => strasse
kierkegård => kierkegard
joão => joao
bjørk => bj?rk
rsl@sneaky ~ > ./omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
rsl@sneaky ~ > ruby omg.rb
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
rsl@sneaky ~ > irb omg.rb
omg.rb(main):001:0> #!/usr/bin/env ruby
omg.rb(main):002:0* require "iconv"
=> true
omg.rb(main):003:0> ["ægis", "straße", "kierkegård", "joão",
"bjørk"].each do |word|
omg.rb(main):004:1* puts "#{word} =>
#{Iconv.iconv('ascii//translit', 'utf-8', word)}"
omg.rb(main):005:1> end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
=> ["\303\246gis", "stra\303\237e", "kierkeg\303\245rd",
"jo\303\243o", "bj\303\270rk"]
omg.rb(main):006:0> exit
rsl@sneaky ~ > cat omg.rb | irb
#!/usr/bin/env ruby
require "iconv"
true
["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word|
puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}"
end
ægis => aegis
straße => strasse
kierkegård => kierkeg?rd
joão => jo?o
bjørk => bj?rk
["\303\246gis", "stra\303\237e", "kierkeg\303\245rd", "jo\303\243o",
"bj\303\270rk"]
exit
on 03.09.2007 00:41
on 03.09.2007 13:29
I just realized that I didn't really ask a question there, did I? Woops. I'd like to know what I'm doing wrong or perhaps just how to do it right so that I get the same results. Do I need to manually include another library that somehow isn't getting included in the other ways I've tried? I'd really like to be able to count on my Iconv.iconv code doing what I need all the time but it seems I can't at the moment. :( Here's hoping someone can help me solve this puzzle. RSL
on 03.09.2007 20:29
> RSL ___ wrote: > I just realized that I didn't really ask a question there, did I? > Woops. I'd like to know what I'm doing wrong or perhaps just how to do > it right so that I get the same results. Do I need to manually include > another library that somehow isn't getting included in the other ways > I've tried? I'd really like to be able to count on my Iconv.iconv code > doing what I need all the time but it seems I can't at the moment. :( > > Here's hoping someone can help me solve this puzzle. > > RSL Converting from utf-8 to iso-8859-1 seems to work though (source code file is encoded in UTF-8 and character set encoding of Terminal.app is set to ISO Latin 1, i.e. iso-8859-1). ... p "#{word} => #{Iconv.iconv('iso-8859-1//translit', 'utf-8', word)}" ... => irb(main):020:0> load "omg.rb" "\303\246gis => \346gis" "stra\303\237e => stra\337e" "kierkeg\303\245rd => kierkeg\345rd" "jo\303\243o => jo\343o" "bj\303\270rk => bj\370rk" => true Cheers j.k.
on 03.09.2007 21:57
[Russell Norris <rsl@swimcommunity.org>, 2007-09-03 00.40 CEST] [...] > Howdy. I'm working with Iconv and discovered that this code > > require "iconv" > ["ægis", "straße", "kierkegård", "joão", "bjørk"].each do |word| > puts "#{word} => #{Iconv.iconv('ascii//translit', 'utf-8', word)}" > end [...] > rsl@sneaky ~ > ./omg.rb > ægis => aegis > straße => strasse > kierkegård => kierkeg?rd > joão => jo?o > bjørk => bj?rk Hi, Russell. Apparently the ASCII transliteration rules are defined in locale data files, and not all locales define all of them (and some that define it, do it differently). The resolution of this bug report explains the situation a little more: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272 Good luck.
on 04.09.2007 00:49
Carlos wrote: >> ægis => aegis > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=376272 Also I get different results depending on platform: Ubuntu/iconv (GNU libc) 2.3.6: ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", "caff\303\250")' => caff? FreeBSD/iconv (GNU libiconv 1.9) ruby -riconv -e 'puts Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", "caff\303\250")' => caff`e This doesn't explain why you have different results in irb and ruby, but it does show how unreliable Iconv translit can be. IMHO you'd be better off using the Unicode gem if you want to decompose UTF8 strings. Daniel