On Sep 29, 2008, at 7:04 PM, Simon Watson wrote:
page.
Regards,
Do you mean to say that x holds a Unicode code point? If that’s the
case (since ASCII is a subset of Unicode, x.to_s => “1046” is
trivial), then you can use something like this code I wrote a while
back:
irb> (“U+”+(‘0’*4+x.to_s(16))[-4,4]).to_utf8
=> “\320\226”
Of course, you could hide most of that in an Integer#to_utf8 method.
-Rob
Rob B. http://agileconsultingllc.com
[email protected]
-- ruby --
class String
For a string that matches /(?i:U+?|\u)?\d{4}/, return a
suitable UTF-8
string for that character.
def to_utf8
case point = self.match(/[[:xdigit:]]{4}/)[0].to_i(16)
when 0…0x7f
point.chr
when 0x80…0x07ff
x = point & 0b111111
point >>= 6
y = point
“#{(0xC0 | y).chr}#{(0x80 | x).chr}”
when 0x0800…0xFFFF
x = point & 0b111111
point >>= 6
y = point & 0b111111
point >>= 6
z = point
“#{(0xE0 | z).chr}#{(0x80 | y).chr}#{(0x80 | x).chr}”
when 0x10000…0x10FFFF
raise NotImplementedError, “UTF-8 four byte sequences not yet
supported”
else
raise ArgumentError, “Values above U+10FFFF are not supported”
end
end
end
if FILE == $0
require ‘test/unit’
class UnicodeHelperTest < Test::Unit::TestCase
def test_ascii
assert_equal ‘!’, “U+0021”.to_utf8, ‘EXCLAMATION MARK’
assert_equal ‘A’, “U+0041”.to_utf8, ‘UPPERCASE LETTER A’
assert_equal ‘-’, “U+002D”.to_utf8, ‘HYPHEN-MINUS’
assert_equal ‘~’, “U+007E”.to_utf8, ‘TILDE’
assert_equal '!', "0021".to_utf8, 'EXCLAMATION MARK'
assert_equal 'A', "0041".to_utf8, 'UPPERCASE LETTER A'
assert_equal '-', "002D".to_utf8, 'HYPHEN-MINUS'
assert_equal '~', "007E".to_utf8, 'TILDE'
assert_equal '!', "\\u0021".to_utf8, 'EXCLAMATION MARK'
assert_equal 'A', "\\u0041".to_utf8, 'UPPERCASE LETTER A'
assert_equal '-', "\\u002D".to_utf8, 'HYPHEN-MINUS'
assert_equal '~', "\\u007E".to_utf8, 'TILDE'
end
def test_hi_bit_ascii
assert_equal "\xC2\x80", "U+0080".to_utf8, "C-cedilla"
assert_equal "\xC2\xA4", "U+00A4".to_utf8, "Spanish n-tilde"
end
def test_general_punctuation
assert_equal "\342\200\220", "U+2010".to_utf8, "HYPHEN"
assert_equal "\342\200\221", "U+2011".to_utf8, "NON-BREAKING
HYPHEN"
assert_equal “\342\200\222”, “U+2012”.to_utf8, “FIGURE DASH”
assert_equal “\342\200\223”, “U+2013”.to_utf8, “EN DASH”
assert_equal “\342\200\224”, “U+2014”.to_utf8, “EM DASH”
assert_equal “\342\200\225”, “U+2015”.to_utf8, “QUOTATION DASH”
end
end
end
END