Ruby/Unicode library

rob · June 18, 2006, 8:19pm

Since there’s been a lot of talk about Unicode lately, I thought I’d
throw out a Ruby library I’ve been working on to support Unicode
characters and strings based on the 4.1.0 standard and key
specifications from the Unicode Consortium.

ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and
allows conversion to and from Unicode::String and Unicode::Character.
A default encoding is chosen based on $KCODE, or the default can be
set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to
native strings, e.g. +“Hello” (where the native string is encoded in
the default encoding).

% irb -I. -runicode -Ku
irb(main):001:0> ustr = +“Ï? is pi”
=> +“Ï? is pi”

Native strings are obtained from Unicode strings by calling to_s,
which accepts an optional argument to indicate the desired encoding.

irb(main):002:0> str = ustr.to_s
=> “Ï? is pi”
irb(main):003:0> str.encoding
=> Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning
a Unicode::Character object.

irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

irb(main):005:0> ustr.upcase
=> +“Î IS PI”

Normalization is accomplished with the ~ unary operator.

irb(main):006:0> ustr = +“mÃ”
=> +“mÃ”
irb(main):007:0> ustr.to_a
=> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=> +“mÃ”

There is much more – character properties, text boundaries (grapheme
clusters and words), Hangul decompositions, modular encodings (ASCII,
Latin1, EUC, SJIS, UTF32, UTF16, UTF8) – yet the project is
unfinished. If anyone is interested in helping develop it further,
let me know.

The library incorporates the entire Unicode 4.1.0 Character Database
(demand-loaded!) which is why the archive is rather large.

Cheers,

rob · June 18, 2006, 8:59pm

On 18-jun-2006, at 20:11, Rob L. wrote:

Since there’s been a lot of talk about Unicode lately, I thought
I’d throw out a Ruby library I’ve been working on to support
Unicode characters and strings based on the 4.1.0 standard and key
specifications from the Unicode Consortium.

Holy wow. But the tables are just huge.

rob · June 18, 2006, 9:14pm

On Jun 18, 2006, at 11:51 AM, Julian ‘Julik’ Tarkhanov wrote:

Since there’s been a lot of talk about Unicode lately, I thought
I’d throw out a Ruby library I’ve been working on to support
Unicode characters and strings based on the 4.1.0 standard and key
specifications from the Unicode Consortium.

Holy wow. But the tables are just huge.

I should point out that I’m not presently using most of these tables;
Unihan.txt alone is 27M. They’re included purely for completeness as
I’ve been developing the library.

No doubt the actual data storage requirements can be reduced
considerably.

rob · June 18, 2006, 11:37pm

On 18/06/06, Rob L. [email protected] wrote:

I should point out that I’m not presently using most of these tables;
Unihan.txt alone is 27M. They’re included purely for completeness as
I’ve been developing the library.

No doubt the actual data storage requirements can be reduced
considerably.

That’s an impressive achievement. It looks like a textbook
implementation. Thanks for sharing!

Coincidentally, I just dug up my own dormant UnicodeData.txt-based
effort - nowhere near as developed as yours - and hacked a bit on it
today, trying out some storage-reduction ideas. I’m looking forward to
trying things with your library.

Paul.