The dreaded Unicode issue


#1

Hi all,

I’m trying to store the following in a mysql DB but am having no luck:

  1. Pär
  2. Björn
  3. gösta
    (hopefully the encodings came out correct in this message)

I’ve looked into Ruby’s Unicode library and have tried things such as:

Unicode.normalize_D(nickname)
Unicode.normalize_KD(nickname)
Unicode.normalize_C(nickname)
Unicode.normalize_KC(nickname)

All those methods either seem to strip out the European characters or
replace them with their English equivalents. All I want to do is store
the
text in its original form.

If I try to store the string as it currently is, I get question marks
showing up in all my views where the European characters are.

Does anyone have any ideas on how to accomplish such a thing?

Thanks!
-Dave H.


#2

I’m trying to store the following in a mysql DB but am having no luck:

  1. Pär
  2. Björn
  3. gösta
    <…>
    If I try to store the string as it currently is, I get question marks
    showing up in all my views where the European characters are.
    Does anyone have any ideas on how to accomplish such a thing?

It may well be, that your characters are ok in DB, but there is
different charset specified in
your HTTP headers. Check what headers server is sending (you can do it
with the help of LiveHTTPHeaders or Firebug extensions for Firefox).
Or the simplest test - check out, what
menu item is checked on View-Character Encoding menu in your browser.
If UTF is not checked, try to check it and see if characters are OK.
If yes, that means you have to fix HTTP headers. If no, then problems
may indeed be in the database.

Regards,
Rimantas

http://rimantas.com/


#3

Thanks Rimantas,

I checked out the headers and they are sending:
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*,q=0.7

My database is using ISO Latin 1.

I failed to mention before that I’m loading a ruby script of users in a
text
file and executing it from the terminal.

If I were to enter Unicode characters through a form in my Rails view
then
everything works fine. “Bjorn” is properly saved as “Bj\303\266rn”. It’s
just loading the text file and looping through the records that seems to
be
the problem.

The important parts of the script:

$KCODE = ‘u’
require ‘jcode’
require ‘unicode’

arr = IO.readlines(member_file) # member_file is actually a text file
member = Member.new(nickname)
member.save

That obviously happens within a loop. I just wanted to uber-simplify it.

Still kind of confused…

-Dave


#4

On 1/2/07, Dave H. removed_email_address@domain.invalid wrote:

Thanks Rimantas,

My database is using ISO Latin 1.

You might want to change that to UTF-8. Your accented unicode characters
(western european) use a single byte under latin-1 and 2 under utf-8.

Every link in the unicode chain (the round-trip) has to be fixed.


#5

I checked out the headers and they are sending:
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*,q=0.7

These are sent by browser, how does server response look like?
Look for the ‘Content-type’ header.

My database is using ISO Latin 1.
<…>

This may or may not be the problem. It is possible (but not advisable)
to keep utf data in
database tables with Latin 1 encoding.
However it would be better to convert your tables (and data) to UTF,
then make sure that you
have line ‘encoding: utf8’ in your config/database.yml

This part is confusing, so I’ll try to describe some scenarios:
a) Database’s encoding is set as Latin 1 (default) and there is no
encoding line in config/database.yml. This will work, at least data
can be saved/retrieved and displayed correctly
in the browser - given the HTTP headers are right.

b) Database’s encoding is Latin 1, but database.yml sets encoding to
UTF. This will not work, because that line makes database client to
issue “SET NAMES UTF8” when connecting to the database. In this case
MySQL will try to convert data which it thinks is in Latin 1 to UTF.
Alas, data is already UTF, but wrong encoding was specified–what you
get is useless conversion and character junk.

c) Like b), but in this case database is set to UTF, and data is in
UTF, but there is no encoding line in database.yml. In this case
connection’s encoding will be Latin 1 and MySQL will try to convert
data from UTF to Latin 1. This will mess things, assuming HTTP
headers/meta tag says that encoding is UTF.

So what would I do in your situation:

  1. Try to change view-encoding to UTF-8 in the browser. If that fixes
    how characters are displayed, then, most likely, HTTP headers are
    messed up.
  2. Check ‘Content-type’ header of the server response. If it says
    anything else than UTF - that must be fixed.
  3. When headers are ok and page is displayed the way it should -
    convert tables and data in DB to UTF and then add encoding: utf8 to
    config/database.yml
  4. Enjoy.

Of course other scenarios are likely too, which would require
additional exploration.

Regards,
Rimantas

http://rimantas.com/


#6

On Jan 2, 2007, at 3:34 PM, Dave H. wrote:

Look for the ‘Content-type’ header.
have line ‘encoding: utf8’ in your config/database.yml
MySQL will try to convert data which it thinks is in Latin 1 to UTF.

  1. Try to change view-encoding to UTF-8 in the browser. If that fixes
    additional exploration.

Regards,
Rimantas

http://rimantas.com/

This response is someone late, but if you’re using migrations, here’s
one to put your MySql tables and database to UTF8:

class UseUtf8AsDefaultCharset < ActiveRecord::Migration
def self.my_number_is
say File.basename(FILE).to_i
end

def self.database_charset(charset=‘utf8’)
execute “ALTER DATABASE #{current_database} DEFAULT CHARACTER
SET #{charset}”
end

def self.table_charset(table, charset=‘utf8’)
execute “ALTER IGNORE TABLE #{table} CONVERT TO CHARACTER SET #
{charset}, DEFAULT CHARACTER SET #{charset}”
end

def self.up
my_number_is
database_charset(‘utf8’)
select_values(“SHOW TABLES”).each do |table|
table_charset(table, ‘utf8’)
end
end

def self.down
my_number_is
database_charset(‘latin1’)
select_values(“SHOW TABLES”).each do |table|
table_charset(table, ‘latin1’)
end
end
end

Note that the down migration puts the tables back to their “original”
character set which, in my case, was latin1.

-Rob

Rob B. http://agileconsultingllc.com
removed_email_address@domain.invalid


#7

Thanks Rob! That will come in handy.

-Dave


#8

Thank you very much Rimantas for that explanation. That was just what I
needed. I will try your suggestions.

-Dave