Unicode escaping fun & games


#1

Hi folks,

After my last question, I finally sat down and figured out how to easily
do the kinds of conversions I wanted (at least the Unicode UTF-8 part).
Here’s what I came up with in the event that it may be useful to others
having to exchange encoded 7-bit data across environments.

excalibur$ cat utf8.rb

Created: Thu Apr 23 17:03:23 IST 2009

This is some quick code to deal with UTF-8 manipulation and

serialization of 7-bit ASCII representations.

$KCODE=‘u’
require ‘jcode’

def utf8_escape(str)
s = “”
str.each_char do |c|
x = c.unpack(“C”)[0]
if x < 128
s << c
else
s << “\u%04x” % c.unpack(“U”)[0]
end
end
s
end

def utf8_unpack(str)
str.gsub(/\u([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])/) do
[ $1.hex ].pack(“U*”)
end
end

Running it:

excalibur$ irb
irb(main):001:0> require ‘utf8’
=> true
irb(main):002:0> s = “Hello €!”
=> “Hello €!”
irb(main):003:0> t = utf8_escape(s)
=> “Hello \u20ac!”
irb(main):004:0> u = utf8_unpack(t)
=> “Hello €!”
irb(main):005:0> s == u
=> true

excalibur$ irb
irb(main):001:0> s = "à cA绋féà "
=> “\303\240cA\347\273\213f\303\251\303\240”
irb(main):002:0> require ‘utf8’
=> true
irb(main):003:0> s = "à cA绋féà "
=> "à cA绋féà "
irb(main):004:0> t = utf8_escape(s)
=> “\u00e0cA\u7ecbf\u00e9\u00e0”
irb(main):005:0> u = utf8_unpack(t)
=> "à cA绋féà "
irb(main):006:0> s == u
=> true

It may not be 100% bullet-proof, but it works for some simple examples
that I could find, so this may be as far as I need to go with that part.
The next step is to roll this into a one-pass string escaping routine so
you don’t need to do a bunch of gsub calls.

Any suggestions, comments and improvements are welcome.

Cheers,

ast