On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Z. wrote:
Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.
Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before.
Not to mention that Matz has explicitly stated in the past that he
character class, and that was Java’s main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to “extend” it).
Um. Do you mean UTF-32? Because there’s no binary representaiton of
Unicode Character Code Points that isn’t an encoding of some sort. If
that’s the case, that’s unacceptable from a memory representation.
Yes, I do mean the String interface to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don’t need a
bitwise representation at all.
According to my proposal, Strings do not need an encoding from the
String user’s point of view when working just with Strings, and users
won’t care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.
Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it’ll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You may code this to recode directly from
the internal storage format for performance reasons, but that’ll be
transparent to the String user.)
This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue. But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they’ll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that’ll be no
worse than having different, possibly incompatible kinds of Strings.
We’ll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.
Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is
a pure 16 bit per character encoding and cannot represent codepoints
above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks. But
their abstraction of a single character, the class Char(acter), is
still only 16 bit wide which leads to confusion and similiar to the C
type char, which cannot represent all real characters either. It is
even worse than in C, because C explicitly defines char to be a memory
cell of 8 bits or more, whereas Java really meant Char to be a
I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?
Legacy data and performance.
Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String’s internal storage
format, so we don’t have to worry about reencoding them.
I am not worried about performance. I’d code in C if I were, or
For one, Moore’s law is at work and my whole proposal was for 2.0. My
proposal only adds a constant factor to String handling, it doesn’t
have higher order complexity.
On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often. String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.