Multibyte Character References

Hi,

I have a load of records in my database which were imported through
processing a YAML file. These original YAML files were created from the
‘to_yaml’ function of an array of Hash objects.

The YAML file contains multibyte character references such as:

…and between them and today\xE2\x80\x99s College. The scope, r…

When I imported this data into my DB these character references have
changed but are still there in the DB:

…and between them and today\342\200\231s College. The scope, r…

So I have two questions:

  1. Are the original characters retreivable from the copy in the DB, or
    has it been mangled?

  2. If the above answer is yes, then how!

Really appreciate any help on this one. Many thanks in advance.

~ Mark

Mark D. wrote:

When I imported this data into my DB these character references have

Really appreciate any help on this one. Many thanks in advance.

~ Mark

What’s the encoding in the YAML file (presumably UTF-8), what database
are you using and what encoding is your database/table set to?


Michael W.

Hi Michael,

The DB is ‘ISO Latin 1 (latin1)’ encoding.

I’m not sure about the original YAML file (do you know the default
encoding for .to_yaml?) - but when I open it directly with, say
TextMate, it shows the character reference not the actual character.

Thanks,

~ Mark

Mark D. wrote:

~ Mark
MySQL, if that’s what you are using, let’s you set the character
encoding at various different levels (server, database, table, column).
If you are using MySQL you could try something like an ALTER TABLE to
change the encoding to UTF-8 (which I’m guessing is what the original
YAML data is in). You might have to export the data and import it into a
table that’s already set to UTF-8, though, in which case if you still
have all the YAML data around it might be easier just to reload that
with the table set to the proper encoding.

http://dev.mysql.com/doc/refman/5.0/en/charset.html


Michael W.

Thanks for the help Michael.

Eventually, I managed to sort this without having to reimport. Thought,
I’d post how in case somebody else got stuck in a similar way.

Model.find_all.each do |m|
  content = m.content
  content = Iconv.iconv('ISO-8859-1//TRANSLIT', 'UTF-8', 

content).to_s
m.update_attribute(:content, content)
end

This translated the UTF-8 encoded chars into ISO-8859-1 encoded
equicalents. They are not the exact characters obviously, but close
approximations.

~ Mark