A few good articles on Unicode

To add a little fuel to the discussion (and to help dispel some rumors,
myths, and legends about Unicode) I present you with Tim B.'s 4-part
trilogy of articles on Unicode, why it’s important, and why you should
use
it. The first article provides a nice overview, even mentioning some of
the
political and technical difficulties of CJK languages and Unicode (as
well
as the previously-mentioned gaiji). The second article discusses
character
strings in general. The third, perhaps most relevant to the Ruby Unicode
discussion is an exploration of characters versus bytes, and how the
various
encodings work. The fourth article discusses Java’s use of UTF-16
internally, and why that may be a good or bad thing.

At any rate, they’re entertaining to read and cleared up a number of my
own
questions about Unicode. Perhaps they will help the rest of us in the
Ruby
community to understand Unicode as well.

Part 1: On the Goodness of Unicode -
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
Part 2: On Character Strings -
http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
Part 3: Characters vs. Bytes -
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
Part 4: Programming Languages and Text -
http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings

And while not directly related, Tim also fiddled with a
fully-unicode-supporting UTF-8 string class in Java with many of the
typical
C string operations (strcpy, strstr, …). Some of the logic he uses for
his
byte-vector-as-unicode-string might be applicable to Ruby as well:

Yooster (Ustr):
http://www.tbray.org/ongoing/When/200x/2003/05/17/Yooster

On Jun 16, 2006, at 1:15 AM, Charles O Nutter wrote:

The fourth article discusses Java’s use of UTF-16
internally, and why that may be a good or bad thing.

Excellent! I’m particularly interested to learn more about pros/cons
between using UTF-16 internally for all strings (Java) vs. being able
to specify different encoding for each string object (Ruby 2.0).

Thanks for sharing,

Daesan

Dae San H.
[email protected]

“Charles O Nutter” [email protected] writes:

Part 4: Programming Languages and Text -
ongoing by Tim Bray · Programming Languages and Text

And while not directly related, Tim also fiddled with a
fully-unicode-supporting UTF-8 string class in Java with many of the typical
C string operations (strcpy, strstr, …). Some of the logic he uses for his
byte-vector-as-unicode-string might be applicable to Ruby as well:

Yooster (Ustr): ongoing by Tim Bray · Yooster, v0.1

While were are at it, also see
“The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)”

On Thursday 15 June 2006 3:50 pm, Christian N. wrote:

While were are at it, also see

And it’s probably worth mentioning that O’Reilly has a 678 page book on
Unicode coming to bookstores by the end of the month:

http://www.oreilly.com/catalog/unicode/index.html

HTH,
Keith