Unicode escaping fun & games

Hi folks,

After my last question, I finally sat down and figured out how to easily
do the kinds of conversions I wanted (at least the Unicode UTF-8 part).
Here’s what I came up with in the event that it may be useful to others
having to exchange encoded 7-bit data across environments.

excalibur$ cat utf8.rb

Created: Thu Apr 23 17:03:23 IST 2009

This is some quick code to deal with UTF-8 manipulation and

serialization of 7-bit ASCII representations.

$KCODE=‘u’
require ‘jcode’

def utf8_escape(str)
s = “”
str.each_char do |c|
x = c.unpack(“C”)[0]
if x < 128
s << c
else
s << “\u%04x” % c.unpack(“U”)[0]
end
end
s
end

def utf8_unpack(str)
str.gsub(/\u([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])/) do
[ $1.hex ].pack(“U*”)
end
end

Running it:

excalibur$ irb
irb(main):001:0> require ‘utf8’
=> true
irb(main):002:0> s = “Hello €!”
=> “Hello €!”
irb(main):003:0> t = utf8_escape(s)
=> “Hello \u20ac!”
irb(main):004:0> u = utf8_unpack(t)
=> “Hello €!”
irb(main):005:0> s == u
=> true

excalibur$ irb
irb(main):001:0> s = "à cA绋féà "
=> “\303\240cA\347\273\213f\303\251\303\240”
irb(main):002:0> require ‘utf8’
=> true
irb(main):003:0> s = "à cA绋féà "
=> "à cA绋féà "
irb(main):004:0> t = utf8_escape(s)
=> “\u00e0cA\u7ecbf\u00e9\u00e0”
irb(main):005:0> u = utf8_unpack(t)
=> "à cA绋féà "
irb(main):006:0> s == u
=> true

It may not be 100% bullet-proof, but it works for some simple examples
that I could find, so this may be as far as I need to go with that part.
The next step is to roll this into a one-pass string escaping routine so
you don’t need to do a bunch of gsub calls.

Any suggestions, comments and improvements are welcome.

Cheers,

ast

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs