On Thu, Jun 15, 2006 at 06:34:11AM +0900, Randy K. wrote:
understand / recall what UTF-8 encoding is exactly:
Wikipedia has decent articles on unicode at
Basically, Unicode gives every character worldwide a unique number,
called code point. Since this numbers can be quite large (currently
up to 21 bit), and especially western users usually only use a tiny
subset, different encoding were created to save space, or remain
backward compatible with 7 bit ASCII.
UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes. Most western symbols only require 1 or 2
bytes. This encoding is space efficient, and ASCII compatible as long
as only 7 bit characters are used. Certain string operation
are quite hard or inefficient, since the position of characters, or
even the length of a string, given a byte stream, is uncertain without
counting actual characters (no pointer/index arithmetic!).
UTF-32 encodes every code point as a single 32 bit word. This enables
simple, efficient substring access, but wastes space.
Other encodings have yet different characteristics, but all deal with
encoding the same code points. A Unicode String class should expose
code points, or sequences of code points (characters), not the
internal encoding used to store them and that is the core of my
I’m beginning to think (with a newbie sort of perspective) that Unicode is too
complicated to deal with inside a program. My suggestion would be that
Unicode be an external format…
What I mean is, when you have a program that must handle international text,
convert the Unicode to a fixed width representation for use by the program.
Do the processing based on these fixed width characters. When it’s complete,
convert it back to Unicode for output.
UTF-32 would be such an encoding. It uses quadruple space for simple 7
bit ASCII characters, but with such a dramatically larger total
character set, some tradeoffs are unavoidable.
in the world–iirc, 16 bits (2**16) didn’t cut it for Unicode–would 32 bits?
Currently Unicode requires 21 bit, but this has changed in the past.
Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.
A split of simple and Unicode-aware will divide code into two camps,
which will remain slightly incompatible or require dirty hacks. I’d
rather prolonge the status quo, where Strings can be seen to contain
bytes in whatever encoding the user sees fit, but might break if used
with foreign code which has other notions of encoding.