Since there’s been a lot of talk about Unicode lately, I thought I’d
throw out a Ruby library I’ve been working on to support Unicode
characters and strings based on the 4.1.0 standard and key
specifications from the Unicode Consortium.
ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2
The library adds an encoding property to native String objects, and
allows conversion to and from Unicode::String and Unicode::Character.
A default encoding is chosen based on $KCODE, or the default can be
set/changed explicitly via String.default_encoding.
Unicode strings can be obtained by applying the + unary operator to
native strings, e.g. +“Hello” (where the native string is encoded in
the default encoding).
% irb -I. -runicode -Ku
irb(main):001:0> ustr = +“Ï? is pi”
=> +“Ï? is pi”
Native strings are obtained from Unicode strings by calling to_s,
which accepts an optional argument to indicate the desired encoding.
irb(main):002:0> str = ustr.to_s
=> “Ï? is pi”
irb(main):003:0> str.encoding
=> Unicode::Encoding::UTF8
Individual characters can be indexed from Unicode strings, returning
a Unicode::Character object.
irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI
Case conversion is handled as with native strings.
irb(main):005:0> ustr.upcase
=> +“Î IS PI”
Normalization is accomplished with the ~ unary operator.
irb(main):006:0> ustr = +“mÔ
=> +“mÔ
irb(main):007:0> ustr.to_a
=> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH
ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=> +“mÔ
There is much more – character properties, text boundaries (grapheme
clusters and words), Hangul decompositions, modular encodings (ASCII,
Latin1, EUC, SJIS, UTF32, UTF16, UTF8) – yet the project is
unfinished. If anyone is interested in helping develop it further,
let me know.
The library incorporates the entire Unicode 4.1.0 Character Database
(demand-loaded!) which is why the archive is rather large.
Cheers,