Active Record serialized attr loosing encoding (YAML issue)

I’m using Rails 2.3.8 with Ruby 1.9.1 and I’m having a problem with
serialized attributes in active record not preserving string encodings.
The underlying problem is probably yaml, but I’m wondering if anyone has
any good ideas on how to handle this. The app I’m working on has
numerous serialized fields some of which contain deep structures of
arrays and hashes. Getting back an ASCII-8Bit string (that’s actually
UTF-8) deep within those structures wrecks havoc later…

Perhaps best illustrated by example, if I save l to a serialized attr in
an active record model I’ll get back l2 on reading from the database.

l
=> [“English”, “Türkçe”, “Русский”]

l.map(&:encoding)
=> [#Encoding:UTF-8, #Encoding:UTF-8, #Encoding:UTF-8]

l.map(&:valid_encoding?)
=> [true, true, true]

l.to_yaml
=> “— \n- English\n- !binary |\n VMO8cmvDp2U=\n\n-
“\xD0\xA0\xD1\x83\xD1\x81\xD1\x81\xD0\xBA\xD0\xB8\xD0\xB9”\n”

l2 = YAML.load(l.to_yaml)
=> [“English”, “T\xC3\xBCrk\xC3\xA7e”, “Русский”]

l2.map(&:encoding)
=> [#Encoding:UTF-8, #Encoding:ASCII-8BIT, #Encoding:UTF-8]

Does anyone know how yaml decides on whether or not to store a string as
binary vs. as an escaped string? Both the last two strings above are
non-ascii-7 but only the first is stored as binary…

From a quick scan of your question, perhaps ya2yaml
(http://rubyforge.org/projects/ya2yaml/) would help? ‘Ya2YAML is “yet
another to_yaml”. It emits YAML document with complete UTF8 support
(string/binary detection, “\u” escape sequences and Unicode specific
line breaks).’

Thanks ya2yaml is good suggestion. Took a look at it and it does the
right thing (and would work except I had trouble getting it to play nice
with active record etc.). I did come up with a different solution that
I’m posting here in case other people run into the same issue.

monkey patching String can force YAML to use \ escaping rather then
binary and therefore return strings in the default encoding (UTF-8)
rather then ASCII-8BIT

class String
def is_binary_data?
encoding == Encoding::ASCII_8BIT unless empty?
end
end

originally this routine uses some heuristics around which would be
shorter \ escaping of binary encoding of the string which is why only
some of the international strings I had were having problems.