On 09/01/06, dseverin [email protected] wrote:
Well, as I could search the web so far, since about 2001 or even
early, once in a while appears question: why ruby does not support
Ruby does support Unicode. It just doesn’t treat it specially.
Why can’t ruby use at least ICU libs?
It could, if you wrote a wrapper for them.
(current state of UTF8 in Ruby, even with regexps, is too far away
from proper Unicode support, don’t try to cheat me, that it’s OK and
enough, it is not!)
For 99% of cases, in fact, is is sufficient. What do you think is
And usual answer is (for years!): m17n will be in Ruby 2.0 (Rite) as
Unicode can’t handle enough chars and Han unification is unacceptable.
That is not correct. m17n strings will be in Ruby 2.0, but it is not
because of “enough chars” (which wouldn’t be true in any case) or Han
unification. It is mostly because of legacy data.
As for me, there are two big problems:
- Ruby String class in current state is TOO MUCH OVERLOADED : it
mixes byte-array and character-text string behaviour at the same time.
That is definitely and absolutely wrong design decision. These are
different paradigms, which must not be mixed ever.
Sorry, but I don’t actually agree. There’s very little evidence that the
Ruby String mixes byte array and character string behaviour in a way
that matters most of the time. The only time it matters is when you
want to do str and get just the first character, and you quickly
learn to do str[0, 1] instead. That is something that will be changing
with m17n strings, but it won’t be a big deal.
- My impession about rite m17n is that for each string it will be
possible to set different encoding. I don’t get it.
That would suggest that you really haven’t done a lot of looking at
character set issues overall. Those of us who do have to deal with
legacy encodings will appreciate this.
As for byte array - encoding is senseless - this is plain bit stream.
And a String without a byte array will be treated just as a byte vector.
And for text - how will one compare/regexp/search using strings in
Generally, one wouldn’t want to. However, I’m sure that it would be
possible to upconvert or downconvert as appropriate for comparison. If
you have something in EUC-JP and need to compare it against SJIS, you
can convert from one to the other or convert both to UTF-16 for
(BTW, Unicode codepoint space is 10^21 - but do we really have over
million of different characters?) What is the sense to create
text-handling support code for all that multitude of encodings? (look
in oniguruma - each encoding plugin sets own procedures and char
properties to deal with multibyte encodings)
shrug Welcome to the real world of encoding hell where we have to deal
with legacy data.
Well, I think, String class must be REMOVED from Rite. Instead, two
incompatible classes must be introduced: ByteArray and Text with
well-separated semantics and behaviour. Else it will never end but
eventually crash into crap ruins someday…
You’re welcome to submit an RCR on it. I am 99.999% certain it will be
shot down, though.
I would certainly oppose it. There are things that I disagree with Matz
on the design of Ruby 2.0 – and have told him so in discussions. The
m17n String, however, is one where I more than agree with him. It’s a
much better solution than I think you will find in most other languages.
Especially since, for most purposes, you as a Ruby programmer won’t care
one way or another.