Ruby 1.8 vs 1.9

Peter_Pincus · November 24, 2010, 7:21pm

On Wed, Nov 24, 2010 at 5:09 PM, Brian C. [email protected]
wrote:

Phillip G. wrote in post #963602:

Convert your strings to UTF-8 at all times, and you are done.

But that basically is my point. In order to make your program
comprehensible, you have to add extra incantations so that strings are
tagged as UTF-8 everywhere (e.g. when opening files).

However this in turn adds nothing to your program or its logic, apart
from preventing Ruby from raising exceptions.

s/apart from preventing Ruby from raising exceptions/but ensures
correctness of data across different systems/;

Maths and computation are not the same thing. Is there anything in the
above which applies only to Ruby and not to floating point computation
in another other mainstream programming language?

You conveniently left out that Ruby thinks dividing by 0.0 results in
infinity.
That’s not just wrong, but absurd to the extreme. S, we have to
safeguard against this. Just like having to safeguard against, say,
proper string encoding. If anyone is to blame, it’s the ANSI and the
IT industry for having a) an extremely US-centric view of the world,
and b) being too damn shortsighted to create an international, capable
standard 30 years ago.

Further, you can’t do any computations without proper maths. In Ruby,
you can’t do computations since it cannot divide by zero properly, or
at least consistently.

Yes, there are gotchas in floating point computation, as explained at
http://docs.sun.com/source/806-3568/ncg_goldberg.html
These are (or should be) well understood by programmers who feel they
need to use floating point numbers.

If you don’t like IEEE floating point, Ruby also offers BigDecimal and
Rational.

Works really well with irrational numbers, that are neither large
decimals, nor can they be expressed as a fraction x/x_0.

In a nutshell, Ruby cannot deal with floating points at all, and the
IEEE standard is a means to represent floating point numbers in
bits. It does not supersede natural laws, much less rules that are
in effect for hundreds of years.

And once the accuracy that the IEEE float represents isn’t good enough
anymore (which happens once you have to simulate a particle system),
you move away from scalar CPUs, and move to vector CPUs / APUs (like
the MMX and SSE instruction sets for desktops, or a GPGPU via CUDA).

If Ruby were to implement floating point following some different set of
rules other than IEEE, that would be (IMO) horrendous. The point of a
standard is that you only have to learn the gotchas once.

Um, no. A standard is a means to avoid misunderstandings, and have a
well-defined system dealing with what the standard defines. You know,
like exchange text data in a standard that can cover as many of the
world’s glyphs as possible.

And there is always room for improvement, otherwise I wonder why
engineers need Maple and mathematicians Mathematica.

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Peter_Pincus · November 24, 2010, 7:51pm

On Wednesday, November 24, 2010 05:14:15 am Brian C. wrote:

For example, an expression like

s1 = s2 + s3

where s2 and s3 are both Strings will always work and do the obvious
thing in 1.8, but in 1.9 it may raise an exception. Whether it does
depends not only on the encodings of s2 and s3 at that point, but also
their contents (properties “empty?” and “ascii_only?”)

In 1.8, if those strings aren’t in the same encoding, it will blindly
concatenate them as binary values, which may result in a corrupt and
nonsensical string.

It seems to me that the obvious thing is to raise an error when there’s
an
error, instead of silently corrupting your data.

This
means the same program with the same data may work on your machine, but
crash on someone else’s.

Better, again, than working on my machine, but corrupting on someone
else’s.
At least if it crashes, hopefully there’s a bug report and even a fix
before
it corrupts someone’s data, not after.

string19/string19.rb at master · candlerb/string19 · GitHub
string19/soapbox.rb at master · candlerb/string19 · GitHub

From your soapbox.rb:

Whether or not you can reason about whether your program works, you
will
want to test it. ‘Unit testing’ is generally done by running the code
with
some representative inputs, and checking if the output is what you
expect.

Again, with 1.8 and the simple line above, this was easy. Give it any
two
strings and you will have sufficient test coverage.

Nope. All that proves is that you can get a string back. It says nothing
about
whether the resultant string makes sense.

More relevantly:

It solves a non-problem: how to write a program which can juggle
multiple
string segments all in different encodings simultaneously. How many
programs do you write like that? And if you do, can’t you just have
a wrapper object which holds the string and its encoding?

Let’s see… Pretty much every program, ever, particularly web apps. The
end-
user submits something in the encoding of their choice. I may have to
convert
it to store it in a database, at the very least. It may make more sense
to
store it as whatever encoding it is, in which case, the simple act of
displaying two comments on a website involves exactly this sort of
concatenation.

Or maybe I pull from multiple web services. Something as simple and
common as
a “trackback” would again involve concatenating multiple strings from
potentially different encodings.

It’s pretty much obsolete, given that the whole world is moving to
UTF-8
anyway. All a programming language needs is to let you handle UTF-8
and
binary data, and for non-UTF-8 data you can transcode at the boundary.
For stateful encodings you have to do this anyway.

Java at least did this sanely – UTF16 is at least a fixed width. If
you’re
going to force a single encoding, why wouldn’t you use fixed-width
strings?

Oh, that’s right – UTF16 wastes half your RAM when dealing with mostly
ASCII
characters. So UTF-8 makes the most sense… in the US.

The whole point of having multiple encodings in the first place is that
other
encodings make much more sense when you’re not in the US.

It’s ill-conceived. Knowing the encoding is sufficient to pick
characters
out of a string, but other operations (such as collation) depend on
the
locale. And in any case, the encoding and/or locale information is
often
carried out-of-band (think: HTTP; MIME E-mail; ASN1 tags), or within
the
string content (think: <?xml charset?>)

How does any of this help me once I’ve read the string?

It’s too stateful. If someone passes you a string, and you need to
make
it compatible with some other string (e.g. to concatenate it), then
you
need to force it’s encoding.

You only need to do this if the string was in the wrong encoding in the
first
place. If I pass you a UTF-16 string, it’s not polite at all (whether
you dup
it first or not) to just stick your fingers in your ears, go “la la la”,
and
pretend it’s UTF-8 so you can concatenate it. The resultant string will
be
neither, and I can’t imagine what it’d be useful for.

You do seem to have some legitimate complaints, but they are somewhat
undermined by the fact that you seem to want to pretend Unicode doesn’t
exist.
As you noted:

“However I am quite possibly alone in my opinion. Whenever this pops up
on
ruby-talk, and I speak out against it, there are two or three others who
speak out equally vociferously in favour. They tell me I am doing the
community a disservice by warning people away from 1.9.”

Warning people away from 1.9 entirely, and from character encoding in
particular, because of the problems you’ve pointed out, does seem
incredibly
counterproductive. It’d make a lot more sense to try to fix the real
problems
you’ve identified – if it really is “buggy as hell”, I imagine the
ruby-core
people could use your help.

Peter_Pincus · November 24, 2010, 8:16pm

On Wed, Nov 24, 2010 at 8:02 PM, Josh C. [email protected]
wrote:

Its wrongness is an interpretation (I would also prefer that it just break,
but I can certainly see why some would say it should be infinity). And it
doesn’t apply only to Ruby:

It cannot be infinity. It does, quite literally not compute. There’s
no room for interpretation, it’s a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn’t
matter if it’s 0, 0.0, or -0.0. Undefined is undefined.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there’s that).

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Peter_Pincus · November 24, 2010, 8:03pm

On Wed, Nov 24, 2010 at 12:20 PM, Phillip G. <
[email protected]> wrote:

Maths and computation are not the same thing. Is there anything in the
above which applies only to Ruby and not to floating point computation
in another other mainstream programming language?

You conveniently left out that Ruby thinks dividing by 0.0 results in
infinity.
That’s not just wrong, but absurd to the extreme.

Its wrongness is an interpretation (I would also prefer that it just
break,
but I can certainly see why some would say it should be infinity). And
it
doesn’t apply only to Ruby:

Java:
public class Infinity {
public static void main(String[] args) {
System.out.println(1.0/0.0); // prints “Infinity”
}
}

JavaScript:
document.write(1.0/0.0) // prints “Infinity”

C:
#include <stdio.h>
int main( ) {
printf( “%f\n” , 1.0/0.0 ); // prints “inf”
return 0;
}

Peter_Pincus · November 24, 2010, 9:22pm

Phillip G. wrote in post #963658:

It cannot be infinity. It does, quite literally not compute. There’s
no room for interpretation, it’s a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn’t
matter if it’s 0, 0.0, or -0.0. Undefined is undefined.

It is perfectly reasonable, mathematically, to assign infinity to 1/0.
To geometers and topologists, infinity is just another point. Look up
the one-point compactification of R^n. If we join infinity to the real
line, we get a circle, topologically. Joining infinity to the real plane
gives a sphere, called the Riemann sphere. These are rigorous
definitions with useful results.

I’m glad that IEEE floating point has infinity included, otherwise I
would run into needless error handling. It’s not an error to reach one
pole of a sphere (the other pole being zero).

Infinity is there for good reason; its presence was well-considered by
the quite knowledgeable IEEE designers.

Peter_Pincus · November 24, 2010, 8:35pm

On Wed, Nov 24, 2010 at 1:16 PM, Phillip G. <
[email protected]> wrote:

matter if it’s 0, 0.0, or -0.0. Undefined is undefined.

From my Calculus book (goo.gl/D7PoI)

“by observing from the table of values and the graph of y = 1/x² in
Figure
1, that the values of 1/x² can be made arbitrarily large by taking x
close
enough to 0. Thus the values of f(x) do not approach a number, so
lim_(x->0)
1/x² does not exist. To indicate this kind of behaviour we use the
notation lim_(x->0) 1/x² = ∞”

Since floats define infinity, regardless of its not being a number, it
is
not “absurd to the extreme” to result in that value when doing floating
point math.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there’s that).

The question was “Is there anything in the above which applies only to
Ruby
and not to floating point computation in another other mainstream
programming language?” the answer isn’t “other languages have the same
issue”, it’s “no”.

Peter_Pincus · November 25, 2010, 12:17am

On Wednesday, November 24, 2010 01:35:12 pm Josh C. wrote:

taking x close enough to 0. Thus the values of f(x) do not approach a
number, so lim_(x->0) 1/x² does not exist. To indicate this kind of
behaviour we use the notation lim_(x->0) 1/x² = ∞"

Specifically, the limit is denoted as infinity, which is not a real
number.

Since floats define infinity, regardless of its not being a number, it is
not “absurd to the extreme” to result in that value when doing floating
point math.

Ah, but it is, for two reasons:

First, floats represent real numbers. Having exceptions to that, like
NaN or
Infinity, is pointless and confusing – it would be like making nil an
integer. And having float math produce something which isn’t a float
doesn’t
really make sense.

Second, 1/0 is just undefined, not infinity. It’s the limit of 1/x as
x goes
to 0 which is infinity. This only has meaning in the context of limits,
because limits are just describing behavior – all the limit says is
that as x
gets arbitrarily close to 0, 1/x gets arbitrarily large, but you still
can’t
actually divide x by 0.

They didn’t teach me that in Calculus, they’re teaching me that in
proofs.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there’s that).

The question was “Is there anything in the above which applies only to Ruby
and not to floating point computation in another other mainstream
programming language?” the answer isn’t “other languages have the same
issue”, it’s “no”.

I don’t know that there’s anything in the above that applies only to
Ruby.
However, Ruby does a number of things differently, and arguably better,
than
other languages – for example, Ruby’s integer types transmute into
Bignum
rather than overflowing.

Peter_Pincus · November 25, 2010, 2:01am

Phillip G. wrote in post #963658:

It cannot be infinity. It does, quite literally not compute. There’s
no room for interpretation, it’s a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn’t
matter if it’s 0, 0.0, or -0.0. Undefined is undefined.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there’s that).

–
Phillip G.

This is not even wrong.

From the definitive source:

The IEEE floating-point standard, supported by almost all modern
floating-point units, specifies that every floating point arithmetic
operation, including division by zero, has a well-defined result. The
standard supports signed zero, as well as infinity and NaN (not a
number). There are two zeroes, +0 (positive zero) and −0 (negative zero)
and this removes any ambiguity when dividing. In IEEE 754 arithmetic, a
÷ +0 is positive infinity when a is positive, negative infinity when a
is negative, and NaN when a = ±0. The infinity signs change when
dividing by −0 instead.

Peter_Pincus · November 25, 2010, 4:39am

On Nov 24, 2010, at 8:40 PM, Jrg W Mittag wrote:

The only two Unicode encodings that are fixed-width are the obsolete
UCS-2 (which can only encode the lower 65536 codepoints) and UTF-32.

And even UTF-32 would have the complications of “combining characters.”

James Edward G. II

Peter_Pincus · November 25, 2010, 3:40am

David M. wrote:

Java at least did this sanely – UTF16 is at least a fixed width. If you’re
going to force a single encoding, why wouldn’t you use fixed-width strings?

Actually, it’s not. It’s simply mathematically impossible, given that
there are more than 65536 Unicode codepoints. AFAIK, you need (at the
moment) at least 21 Bits to represent all Unicode codepoints. UTF-16
is not fixed-width, it encodes every Unicode codepoint as either one
or two UTF-16 “characters”, just like UTF-8 encodes every Unicode
codepoint as 1, 2, 3 or 4 octets.

The only two Unicode encodings that are fixed-width are the obsolete
UCS-2 (which can only encode the lower 65536 codepoints) and UTF-32.

You can produce corrupt strings and slice into a half-character in
Java just as you can in Ruby 1.8.

Oh, that’s right – UTF16 wastes half your RAM when dealing with mostly ASCII
characters. So UTF-8 makes the most sense… in the US.

Of course, that problem is even more pronounced with UTF-32.

German text blows up about 5%-10% when encoded in UTF-8 instead of
ISO8859-15. Arabic, Persian, Indian, Asian text (which is, after all,
much more than European) is much worse. (E.g. Chinese blows up at
least 50% when encoding UTF-8 instead of Big5 or GB2312.) Given that
the current tendency is that devices actually get smaller, bandwidth
gets lower and latency gets higher, that’s simply not a price
everybody is willing to pay.

The whole point of having multiple encodings in the first place is that other
encodings make much more sense when you’re not in the US.

There’s also a lot of legacy data, even within the US. On IBM systems,
the standard encoding, even for greenfield systems that are being
written right now, is still pretty much EBCDIC all the way.

There simply does not exist a single encoding which would be
appropriate for every case, not even the majority of cases. In fact,
I’m not even sure that there is even a single encoding which is
appropriate for a significant minority of cases.

We tried that One Encoding To Rule Them All in Java, and it was a
failure. We tried it again with a different encoding in Java 5, and it
was a failure. We tried it in .NET, and it was a failure. The Python
community is currently in the process of realizing it was a failure. 5
years of work on PHP 6 were completely destroyed because of this. (At
least they realized it before releasing it into the wild.)

And now there’s a push for a One Encoding To Rule Them All in Ruby 2.
That’s literally insane! (One definition of insanity is repeating
behavior and expecting a different outcome.)

jwm

Peter_Pincus · November 25, 2010, 10:46am

On Wed, Nov 24, 2010 at 5:09 PM, Brian C. [email protected]
wrote:

Phillip G. wrote in post #963602:

Convert your strings to UTF-8 at all times, and you are done.

This may be true for the western world but I believe I remember one of
our Japanese friends state that Unicode does not cover all Asian
character sets completely; it could have been a remark about Java’s
implementation of Unicode though, I am not 100% sure.

But that basically is my point. In order to make your program
comprehensible, you have to add extra incantations so that strings are
tagged as UTF-8 everywhere (e.g. when opening files).

However this in turn adds nothing to your program or its logic, apart
from preventing Ruby from raising exceptions.

Checking input and ensuring that data reaches the program in proper
ways is generally good practice for robust software. IMHO dealing
explicitly with encodings falls into the same area as checking whether
an integer entered by a user is strictly positive or a string is not
empty.

And I don’t think you have to do it for one off scripts or when
working in your local environment only. So there is no effort
involved.

Brian, it seems you want to avoid the complex matter of i18n - by
ignoring it. But if you work in a situation where multiple encodings
are mixed you will be forced to deal with it - sooner or later. With
1.9 you get proper feedback while 1.8 may simply stop working at some
point - and you may not even notice it quickly enough to avoid damage.

Kind regards

robert

Peter_Pincus · November 25, 2010, 11:13am

On Thu, Nov 25, 2010 at 10:45 AM, Robert K.
[email protected] wrote:

This may be true for the western world but I believe I remember one of
our Japanese friends state that Unicode does not cover all Asian
character sets completely; it could have been a remark about Java’s
implementation of Unicode though, I am not 100% sure.

Since UTF-8 is a subset of UTF-16, which in turn is a subset of
UTF-32, and Unicode is future-proofed (at least, ISO learned from the
mess created in the 1950s to 1960s) so that new glyphs won’t ever
collide with existing glyphs, my point still stands.

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Peter_Pincus · November 25, 2010, 11:10am

On Thu, Nov 25, 2010 at 2:05 AM, Adam Ms. [email protected] wrote:

This is not even wrong.

From the definitive source:
Division by zero - Wikipedia

For certain values of “definitive”, anyway.

The IEEE floating-point standard, supported by almost all modern
floating-point units, specifies that every floating point arithmetic
operation, including division by zero, has a well-defined result. The
standard supports signed zero, as well as infinity and NaN (not a
number). There are two zeroes, +0 (positive zero) and -0 (negative zero)
and this removes any ambiguity when dividing. In IEEE 754 arithmetic, a
+0 is positive infinity when a is positive, negative infinity when a
is negative, and NaN when a = 0. The infinity signs change when
dividing by -0 instead.

Yes, the IEEE 754 standard defines it that way.

The IEEE standard, however, does not define how mathematics work.
Mathematics does that. In math, x_0/0 is undefined. It is not
infinity (David kindly explained the difference between limits and
numbers), it is not negative infinity, it is undefined. Division by
zero cannot happen. If it would, we would be able to build, for
example, perpetual motion machines.

So, from a purely mathematical standpoint, the IEEE 754 standard is
wrong by treating the result of division by 0.0 any different than
dividing by 0 (since floats are only different in their nature to
computers representing everything in binary [which cannot represent
floating point numbers at all, much less any given irrational
number]).

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Peter_Pincus · November 25, 2010, 12:56pm

On Thu, Nov 25, 2010 at 11:12 AM, Phillip G.
[email protected] wrote:

On Thu, Nov 25, 2010 at 10:45 AM, Robert K.
[email protected] wrote:

This may be true for the western world but I believe I remember one of
our Japanese friends state that Unicode does not cover all Asian
character sets completely; it could have been a remark about Java’s
implementation of Unicode though, I am not 100% sure.

Since UTF-8 is a subset of UTF-16, which in turn is a subset of
UTF-32,

I tried to find more precise statement about this but did not really
succeed. I thought all UTF-x were just different encoding forms of
the same universe of code points.

and Unicode is future-proofed

Oh, so then ISO committee actually has a time machine? Wow!

(at least, ISO learned from the
mess created in the 1950s to 1960s) so that new glyphs won’t ever
collide with existing glyphs, my point still stands.

Well, I support your point anyway. That was just meant as a caveat so
people are watchful (and test rather than believe). But as I
think about it it more likely was a statement about Java’s
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

Kind regards

robert

Peter_Pincus · November 25, 2010, 2:30pm

On Thu, Nov 25, 2010 at 1:37 PM, Phillip G.
[email protected] wrote:

It’s an implicit feature, rather than an explicit one:
Wester languages get the first 8 bits for encoding. Glyphs going
beyond the Latin alphabet get the next 8 bits. If that isn’t enough, n
additional 16 bits are used for encoding purposes.

What bits are you talking about here, bits of code points or bits in
the encoding? It seems you are talking about bits of code points.
However, how these are put into any UTF-x encoding is a different
story and also because UTF-8 knows multibyte sequences it’s not
immediately clear whether UTF-8 can only hold a subset of what UTF-16
can hold.

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

Quoting from RFC 3629 - UTF-8, a transformation format of ISO 10646

Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------±--------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So we have for code point encoding

7 bits
6 + 5 = 11 bits
2 * 6 + 4 = 16 bits
3 * 6 + 3 = 21 bits

This makes 2164864 (0x210880) possible code points in UTF-8. And the
pattern can be extended.

Looking at RFC 2781 - UTF-16, an encoding of ISO 10646 we see that
UTF-16 (at least this version) supports code points up to 0x10FFFF.
This is less than what UTF-8 can hold theoretically.

Coincidentally 0x10FFFF has 21 bits which is what fits into UTF-8.

I stay unconvinced that UTF-8 can handle a subset of code points of
the set UTF-16 can handle.

I also stay unconvinced that UTF-8 encodings are a subset of UTF-16
encodings. This cannot be true because in UTF-8 the encoding unit is
one octet, while in UTF-16 it’s two octets. As a practical example
the sequence “a” will have length 1 octet in UTF-8 (because it happens
to be an ASCII character) and length 2 octets in UTF-16.

“All standard UCS encoding forms except UTF-8 have an encoding unit
larger than one octet, […]”

Of course, test your assumptions. But first, you need an assumption to
start from.

Cheers

robert

Peter_Pincus · November 25, 2010, 2:55pm

On Nov 25, 2010, at 6:37 AM, Phillip G. wrote:

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

You are confusing us.

UTF-8, UTF-16, and UTF-32 are encodings of Unicode code points. They
are all capable of representing all code points. Nothing in this
discussion is a subset of anything else.

James Edward G. II

Peter_Pincus · November 25, 2010, 1:42pm

On Thu, Nov 25, 2010 at 12:56 PM, Robert K.
[email protected] wrote:

Since UTF-8 is a subset of UTF-16, which in turn is a subset of
UTF-32,

I tried to find more precise statement about this but did not really
succeed. I thought all UTF-x were just different encoding forms of
the same universe of code points.

It’s an implicit feature, rather than an explicit one:
Wester languages get the first 8 bits for encoding. Glyphs going
beyond the Latin alphabet get the next 8 bits. If that isn’t enough, n
additional 16 bits are used for encoding purposes.

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

(at least, ISO learned from the
mess created in the 1950s to 1960s) so that new glyphs won’t ever
collide with existing glyphs, my point still stands.

Well, I support your point anyway. That was just meant as a caveat so
people are watchful (and test rather than believe). But as I
think about it it more likely was a statement about Java’s
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

Of course, test your assumptions. But first, you need an assumption to
start from.

–
Phillip G.

Though the folk I have met,
(Ah, how soon!) they forget
When I’ve moved on to some other place,
There may be one or two,
When I’ve played and passed through,
Who’ll remember my song or my face.

Peter_Pincus · November 25, 2010, 3:08pm

On Nov 25, 2010, at 5:56 AM, Robert K. wrote:

But as I think about it it more likely was a statement about Java’s
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

I believe you are referring to the complaints the Asian cultures
sometimes raise against Unicode. If so, I’ll try to recap the issues,
as I understand them.

First, Unicode is a bit larger than their native encodings. Typically
they get everything they need into two bytes where Unicode requires more
for their languages.

The Unicode team also made some controversial decisions that affected
the Asian languages, like Han Unification
(Han unification - Wikipedia).

Finally, they have a lot of legacy data in their native encodings and
perfect conversion is sometimes tricky due to some context sensitive
issues.

I think the Asian cultures have warmed a bit to Unicode over time (my
opinion only), but it’s important to remember that adopting it involved
more challenges for them.

James Edward G. II

Peter_Pincus · November 25, 2010, 3:24pm

On Thu, Nov 25, 2010 at 3:07 PM, James Edward G. II
[email protected] wrote:

The Unicode team also made some controversial decisions that affected the Asian
languages, like Han Unification (Han unification - Wikipedia).

Finally, they have a lot of legacy data in their native encodings and perfect
conversion is sometimes tricky due to some context sensitive issues.

James, thanks for the summary. It is much appreciated.

I think the Asian cultures have warmed a bit to Unicode over time (my opinion
only), but it’s important to remember that adopting it involved more challenges
for them.

I believe that is in part due to our western ignorance. If we would
deal with encodings properly we would probably feel a similar pain -
at least it would cause more pain for us. I have frequently seen i18n
aspects being ignored (my pet peeve is time zones). Usually this
breaks your neck as soon as people from other cultures start using
your application - or such simple things happen as a change of a
database server’s timezone which then differs from the application
server’s.

Kind regards

robert

Peter_Pincus · November 25, 2010, 3:49pm

James,

On 2010-11-26 00:55, James Edward G. II wrote:

On Nov 25, 2010, at 6:37 AM, Phillip G. wrote:

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus,
also, the future-proofing, in case even more glyphs are needed.

You are confusing us.

UTF-8, UTF-16, and UTF-32 are encodings of Unicode code points. They
are all capable of representing all code points. Nothing in this
discussion is a subset of anything else.

This is all really interesting but I don’t understand what you mean by
“code points” - is what you have said expressed diagrammatically
somewhere?

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]