YAML + ASCII Encoded Unicode



Hopefully someone can help me fight through this, been spinning my
wheels for the past eight hours an am making little traction (BTW, on
1.8, could possibly go to 1.9 if some of the file encoding facilities
may solve my woes).


When reading a file generated by YAML.dump, I see strings that read as
such: Donn\xC3\xA9es de votre location de voiture

What I need to do is get those hex characters translated into their
proper Unicode representation (easy enough) and then all my bytes packed
up into some nice array/string that I can then persist back to disk and
see the appropriate Unicode characters (assuming the editor I’m using
can support it).

I’m admittedly not the best when it comes to character encoding - and
YAML has thrown me for a loop by introducing ‘ASCII encoded unicode’
into my life.

I busted open the implementation of unescape in Ruby and saw the
following (http://pastebin.com/m1fc723cf), implemented it myself, but I
still wind back up at the same place, a dump of characters that are
decidedly not what I was hoping to see.

Anyone know how I can get my unicode characters represented?


Something like this should do:

data = File.read(“source.txt”)
data.gsub!(/\x([0-9a-f]{2})/i) { $1.hex.chr }
File.write(“output.txt”,“w”) { |f| f.write(data) }

(You can check with ‘hexdump -C output.txt’ that you get a single byte
for each \xHH code)

Then just open the file with any text editor which supports UTF-8.

However, this file may be invalid YAML - so it may only be useful for
reading and copy/pasting from.

In that case, you might be better just reading in the YAML into Ruby,
and then outputting the parts you’re interested in.

require ‘yaml’
src = “—\nfoo: “Donn\xC3\xA9es de votre location””
data = YAML.load(src)
puts data[“foo”]