Thanks v much for the advice. Thought I’d start with looking at what’s
already in the database using unpack.
- that application, the MySQL query tool, is not UTF-8 aware. So, it
interprets the 2 bytes of “Å‚” (197, 130) as 2 characters in some simple-byte
encoding (probably latin-1), which gives “Ã…” and an unprintable character.
Your test line wasn’t UTF-8 encoded at all.
Yeah, for another db on the server it works fine, so I’m guessing it’s
your 2nd option. Your explanation of the 2 bytes solves another
question I had though
- The application is UTF-8 aware, the test line is in UTF-8, but the data
from your web pages was already in UTF-8 and you thought it wasn’t and
encoded it again to UTF-8.
To test if a string is encoded in UTF-8, just examine its bytes
and see if the diacritic letters are encoded with 2 or more bytes (UTF-8),
or only one (iso-8859-, cp, etc.). (If you see four then you encoded
them twice :).
Here’s a test case
On web page after being loaded from DB: “WyÅ›lij” [This is correct!]
In MySQL Analyser: “WyÃ…â€ºlij” [bad, even though MySQL analyser is
In Interactive Ruby (IRB) printed to console, after loading from DB:
“Wyâ”¼Ã¸lij” [expected in a DOS prompt!]
In IRB unpacked, after loading from DB: [87, 121, 197, 155, 108, 105,
So, I can see that the character “Å›” must correspond to the 3rd and
4th bytes of “WyÅ›lij”.
Looking at the Ruby help, I see I can do this
p str.unpack(“U*”) to get the UTF-8 characters, which gives:
[87, 121, 347, 108, 105, 106]
According to this,
347 is in fact a “Å›”.
This would suggest that the database has UTF-8 text, and it’s getting
into Ruby without corruption! Is this right?
So, the question now is why doesn’t Iconv convert my UTF-8 to Latin2
correctly… That could just be because the original text can’t be
converted due to additional characters outside of the Latin2 set.
I could probably give Iconv explicit mapping codes for how to handle
certain characters, that may do the trick… I’ll re-read your post and
see if I can find anything else.
Thanks for the help, feels like I’m a few steps forward now!
If you can spot any errors in the above a hint would be most welcome!