To add a little fuel to the discussion (and to help dispel some rumors,
myths, and legends about Unicode) I present you with Tim B.'s 4-part
trilogy of articles on Unicode, why it’s important, and why you should
use
it. The first article provides a nice overview, even mentioning some of
the
political and technical difficulties of CJK languages and Unicode (as
well
as the previously-mentioned gaiji). The second article discusses
character
strings in general. The third, perhaps most relevant to the Ruby Unicode
discussion is an exploration of characters versus bytes, and how the
various
encodings work. The fourth article discusses Java’s use of UTF-16
internally, and why that may be a good or bad thing.
At any rate, they’re entertaining to read and cleared up a number of my
own
questions about Unicode. Perhaps they will help the rest of us in the
Ruby
community to understand Unicode as well.
Part 1: On the Goodness of Unicode -
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
Part 2: On Character Strings -
http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
Part 3: Characters vs. Bytes -
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
Part 4: Programming Languages and Text -
http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings
And while not directly related, Tim also fiddled with a
fully-unicode-supporting UTF-8 string class in Java with many of the
typical
C string operations (strcpy, strstr, …). Some of the logic he uses for
his
byte-vector-as-unicode-string might be applicable to Ruby as well:
Yooster (Ustr):
http://www.tbray.org/ongoing/When/200x/2003/05/17/Yooster