I'm using Rails 2.3.8 with Ruby 1.9.1 and I'm having a problem with serialized attributes in active record not preserving string encodings. The underlying problem is probably yaml, but I'm wondering if anyone has any good ideas on how to handle this. The app I'm working on has numerous serialized fields some of which contain deep structures of arrays and hashes. Getting back an ASCII-8Bit string (that's actually UTF-8) deep within those structures wrecks havoc later... Perhaps best illustrated by example, if I save l to a serialized attr in an active record model I'll get back l2 on reading from the database. >> l => ["English", "TÃ¼rkÃ§e", "Ð ÑƒÑÑÐºÐ¸Ð¹"] >> l.map(&:encoding) => [#<Encoding:UTF-8>, #<Encoding:UTF-8>, #<Encoding:UTF-8>] >> l.map(&:valid_encoding?) => [true, true, true] >> l.to_yaml => "--- \n- English\n- !binary |\n VMO8cmvDp2U=\n\n- \"\\xD0\\xA0\\xD1\\x83\\xD1\\x81\\xD1\\x81\\xD0\\xBA\\xD0\\xB8\\xD0\\xB9\"\n" >> l2 = YAML.load(l.to_yaml) => ["English", "T\xC3\xBCrk\xC3\xA7e", "Ð ÑƒÑÑÐºÐ¸Ð¹"] >> l2.map(&:encoding) => [#<Encoding:UTF-8>, #<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>] Does anyone know how yaml decides on whether or not to store a string as binary vs. as an escaped string? Both the last two strings above are non-ascii-7 but only the first is stored as binary...
on 2010-09-02 21:48
on 2010-09-02 22:00
From a quick scan of your question, perhaps ya2yaml (http://rubyforge.org/projects/ya2yaml/) would help? 'Ya2YAML is "yet another to_yaml". It emits YAML document with complete UTF8 support (string/binary detection, "\u" escape sequences and Unicode specific line breaks).'
on 2010-09-02 22:40
Thanks ya2yaml is good suggestion. Took a look at it and it does the right thing (and would work except I had trouble getting it to play nice with active record etc.). I did come up with a different solution that I'm posting here in case other people run into the same issue. monkey patching String can force YAML to use \ escaping rather then binary and therefore return strings in the default encoding (UTF-8) rather then ASCII-8BIT class String def is_binary_data? encoding == Encoding::ASCII_8BIT unless empty? end end originally this routine uses some heuristics around which would be shorter \ escaping of binary encoding of the string which is why only some of the international strings I had were having problems.