Unicode roadmap?

rhaus · June 26, 2006, 7:01pm

On 6/26/06, Jim W. [email protected] wrote:

descriptions of what will happen in the future, but that makes it
I think it would be a great idea to prototype these ideas in real code
to understand the advantages and disadvantages of each.

I mostly agree with you here (about prototyping), Jim. There are a few
things that I think can be done without working code. I often start from
this point in my own programs, anyway. I’ll try to address each of your
questions as I understand them. Hopefully, Matz or other participants
will step in and correct me where I’m wrong.

Before I get started, there are two orthogonal divisions here. The first
division is about the internal representation of a String. There is a
camp that very strongly believes that some Unicode encoding is the only
right way to internally represent String data. Sort of like Java’s
String without the mistake of char being UCS-2. The other camp strongly
believes that forcing a single universal encoding is a mistake for a
variety of reasons and would rather have an unencoded internal
representation with an interpretive encoding tag available. These two
camps can be referred to as UnicodeString and m17nString. I think that I
can be safely classified as in the m17nString camp – but there are
caveats to that which I will address in a moment.

The second division is about the suitability of a String as a
ByteVector. Some folks believe that the twain should never meet, others
believe that there’s little to meaningfully distinguish them in practice
and that the resulting API would be unnecessarily complex. I can safely
be classified in the latter camp.

There is an open question about the resulting String class about how
well it will work with various arcane features of Unicode such as
combining characters, RTL/LTR marks, etc. and these are good questions.
Ultimately, I believe that the answer is that it should support them as
transparently as possible without (a) hiding too much and (b)
compromising support for multiple encodings.

Your first question:

How do I safely concatenate strings with potentially different
encodings?

This deals with the first division. Under the UnicodeString camp, you
would always be able to safely concatenate strings because they never
have a separate encoding. All incoming data would have to be classified
as binary or character data and the character data would have to be
converted from its incoming code page to the internal representation.

Under the m17nString camp, Matz has promised that compatible encodings
would work transparently. I have gone a little further and suggested
that we have a conversion mechanism similar to #coerce for Number
values. I could then combine text from Win1252 and SJIS to get a
Unicode result. Or, if I knew that my target could only handle SJIS, I
would force that to result in an error.

Your second question:

How do I do I/O with encoded strings?

This also sort of deals with the first, but it also deals with the
second. Note, by the way, that the UnicodeString camp would require a
completely separate ByteArray class because you could not then read a
JPEG into a String – its values would be converted to Unicode
representations, rendering it unusable as a JPEG.

The two class (String/ByteArray) camp would probably require that you
either (1) change all IO operations using a pragma-style setting to
encoded strings, (2) change individual IO operations, (3) use a
separate API, or (4) read a ByteArray and convert it to a
UnicodeString. Either way, they seem to want an API where they can say
“read this IO and give me a UnicodeString as output” and conversely
“read this IO and give me a ByteArray as output.” (Note: this could
apply whether we have a UnicodeString or an m17nString – but the
requests have come most often from UnicodeString supporters.)

The one class camp keeps file IO as it is. You can “encourage” a
particular encoding with a variant of #2:

d1 = File.open(“file.txt”, “rb”, encoding: :utf8) { |f| f.read }
d2 = File.open(“file.txt”, “rb”) { |f|
f.encoding = :utf8
f.read
}

However, whether you use an encoding or not, you still get a String
back. Consider:

s1 = File.open(“file.txt”, “rb”) { |f| f.read }
s2 = File.open(“file.txt”, “rb”, encoding: :utf8) { |f| f.read }

s1.class == s2.class # true
s1.encoding == s2.encoding # false

But that doesn’t mean I have to keep treating s1 as a raw data byte
array – or even convert it.

s1.encoding = :utf8
s1.encoding == s2.encoding # true

I think that the fundamental difference here is whether you view encoded
strings as fundamentally different objects, or whether you view the
encodings as lenses on how to interpret the object data. I prefer the
latter view.

-austin

rhaus · June 26, 2006, 7:49pm

On 6/26/06, Christian N. [email protected] wrote:

Partly off-topic, but important nevertheless: Then it’s the right
time to drop that damn “rb” by making it default and let the people
stuck in the \r\n-age use :encoding => “win-ansi” or “dos” or whatever.

Oh, please, yes. I get tired of libraries breaking because people
don’t use “rb” and I’m on Windows.

-austin

rhaus · June 26, 2006, 8:39pm

On 6/26/06, Austin Z. [email protected] wrote:

On 6/26/06, Jim W. [email protected] wrote:

caveats to that which I will address in a moment.
Note that a fixed encoding UnicodeString has several caveats:

you have only one encoding, and while it may be optimal in some
respects it may be suboptimal in other. This leads to split among
UnicodeString supporters - about which encoding to choose. m17n solves
this neatly by allowing you to choose the encoding for every
application at least.
-utf-8 - most likely encountered on io (especially network) = less
conversions. Space efficient for languages using Latin script
-utf-16 - sometimes encountered on io (file names on certain
systems). Space efficient for most(?) other languages
-utf-32 - fast indexing/slicing. Generally easier manipulation (but
only inside the string class)
-you cannot use a non-unicode encoding, or even have both unicode and
non-unicode (with characters outside of unicode) strings without
chnaging the interpreter incompatibly

Another subdivision exists among m17n camp about what strings are
compatible. The behavior in some other languages (which some find
unfortunate) is that strings with different encodings are incompatible
(ie operations on two strings always have to take strings with the
same encoding). In Matz’s current proposal the only improvement over
this is allowing to add 7-bit ascii string to strings where this makes
sense (ie. to ISO-8859-[12], cp85[02], utf-8).
The other position is to make strings to coerce themselves
automatically if lossless conversion exists (ie cp1251, cp852, and
iso-8859-2 should be the same set of characters ordered differently
iirc, and most character sets can be safely converted to utf-8). I
could count myself into the autoconversion camp.

Yet another subdivision is about the exact meaning of string.encoding
= :utf8. It can either just change the tag or check that string is
indeed a valid utf-8 character seequence. Matz thinks that without
checking autoconversion would be too unreliable. I think that checking
would be good for debugging or when one wants to be paranoid. But the
ability to turn it off when I think (or find out) that my application
spends lots of time checking needlessly could be handy.

Ultimately, I believe that the answer is that it should support them as
have a separate encoding. All incoming data would have to be classified
as binary or character data and the character data would have to be
converted from its incoming code page to the internal representation.

Under the m17nString camp, Matz has promised that compatible encodings
would work transparently. I have gone a little further and suggested
that we have a conversion mechanism similar to #coerce for Number
values. I could then combine text from Win1252 and SJIS to get a
Unicode result. Or, if I knew that my target could only handle SJIS, I
would force that to result in an error.

The answer also depends on what strings are compatible. If most
strings are incompatible, you would convert all strings and other data
structures you get from IO or external libraries to your chosen
encoding, and you will only concatenate strings with the same
encoding.
With autoconversion it will just work most of the time (ie when you
work with string that can be converted to unicode).

Writing to streams that do not support all unicode characters is going
to be a problem most of the time (when you do not work in the output
encoding). Unless write attempts the conversion first, and only fails
when there are non-convertible characters.

Your second question:

How do I do I/O with encoded strings?

…

However, whether you use an encoding or not, you still get a String

s1.encoding = :utf8
s1.encoding == s2.encoding # true

I think that the fundamental difference here is whether you view encoded
strings as fundamentally different objects, or whether you view the
encodings as lenses on how to interpret the object data. I prefer the
latter view.

If you consider s3 = File.open(‘legacy.txt’,‘rb’,:iso885915) { |f|
f.read }
without autoconversion you would have to immediately do s3.recode :utf8
otherwise s1 + s3 would not work.

The same for stuff you get from database queries (unless you are sure
you always get the right encoding), text you get from the web, emails,
third party libraries, etc.

Thanks

Michal

rhaus · June 26, 2006, 9:28pm

On 26.6.2006, at 20:37, Michal S. wrote:

array – or even convert it.

If you consider s3 = File.open(‘legacy.txt’,‘rb’,:iso885915) { |f|
f.read }
without autoconversion you would have to immediately do
s3.recode :utf8
otherwise s1 + s3 would not work.

Yes. This shows that if there is no autoconversion, programmer will
always need to recode to a common app encoding if the aplication is
to work without problems. And if we always need to recode strings
which we receive from third-part classes/libraries, encoding handling
will either consume half of the program lines or people won’t do it
and programs will be full of errors. As can be seen from experience
of other languages (and Ruby), the second option will prevail and we
will be in a mess not much better than today.

Therefore m17n without autconversion (as is current Matz’s proposal)
gains us almost nothing. If we have no autoconversion, my vote goes
to Unicode internal encoding (because it implicitly handles
autoconversion problems).

On the topic of ByteArray: my concern is that the distinction between
bytes and characters will not be clear and therefore we need to
introduce ByteArray to separate bytes from characters, to ensure
reliability and predictability of code like result = File.open
( “file” ) { |f| f.read 1000 } (now tell me what ‘result’ is?}.

If there will be clear and simple rules, such as “IO always returns
binary strings if not given encoding parameter” then this distinction
will not need to be additionally enforced by separating classes. One
String class will do.

On the other hand, if there will be all kinds of automatic encoding
tagging for convenience of simple-script-writers, then we need
ByteArray to prevent error-prone code with undefined results.

izidor

rhaus · June 26, 2006, 9:47pm

On 6/26/06, Izidor J. [email protected] wrote:

of other languages (and Ruby), the second option will prevail and we
will be in a mess not much better than today.

I doubt this is in the least bit true. The real problem is that you’re
trying to suggest a doomsday scenario based on what currently exists and
emotion. I’m saying that your cure is far worse than disease.

Therefore m17n without autconversion (as is current Matz’s proposal)
gains us almost nothing. If we have no autoconversion, my vote goes to
Unicode internal encoding (because it implicitly handles
autoconversion problems).

So does the coersion proposal that I’ve made without locking ourselves
into Unicode. If I have a thousand files that are Mojikyo-encoded, it
becomes very inefficient for me to work with it in Unicode and far
easier to work with Mojikyo directly.

I couldn’t make sense of your last paragraph.

-austin

rhaus · June 26, 2006, 10:21pm

On 26.6.2006, at 21:46, Austin Z. wrote:

and programs will be full of errors. As can be seen from experience
of other languages (and Ruby), the second option will prevail and we
will be in a mess not much better than today.

I doubt this is in the least bit true.
I’m saying that your cure is far worse than disease.

Basically, I am just advocating to get autoconversion into “official”
proposal. I am not proposing unicode. But if there is no
autoconversion, unicode is better. This claim is supposed to get
support for autoconversion

BTW, you may have no problems at all. We, on the other hand, have
lots of problems (in Ruby and other languages) which can be traced to
exactly this hope of “all programmers will be doing lots of manual
work to make things safe for others”. You are deluded.

In environments which already have this cure (internal unicode),
there are no such enormous problems as we experience in those without
this cure. So sucessess and failures I describe are based on real
experience. Unlike your claims, which are just opinions.

I am not saying that unicode encoding is the ideal solution. But it
turned out to be quite good one, and for sure much better than manual
checking/changing of encoding.

Therefore m17n without autconversion (as is current Matz’s proposal)
gains us almost nothing. If we have no autoconversion, my vote
goes to
Unicode internal encoding (because it implicitly handles
autoconversion problems).

So does the coersion proposal that I’ve made without locking ourselves
into Unicode.

But that is your proposal (and mine and several others’), not Matz’s.
Current “official” proposal will make a mess.

I couldn’t make sense of your last paragraph.

Well, tell me what exactly do I get when this code executes:

result = File.open( "file ) { |f| f.read( 1000 ) }

What is ‘result’ ? Binary string under all circumstances? Or maybe
sometimes I get a String and sometimes I get a binary String? Which
one under what circumstances?

This is called error-prone code with undefined results.

We have two equally good options:

If we change API and IO returns ByteArray, we have no confusion.
If we have clear and simple rules about IO returning Strings, we
also have no confusion.

Therefore, if there will be complex auto-magic String tagging with
encoding, I prefer introducing ByteArray, because it will prevent
errors.

izidor

rhaus · June 26, 2006, 10:56pm

On 6/26/06, Austin Z. [email protected] wrote:

So does the coersion proposal that I’ve made without locking ourselves
into Unicode. If I have a thousand files that are Mojikyo-encoded, it
becomes very inefficient for me to work with it in Unicode and far
easier to work with Mojikyo directly.

Perhaps this debate should be weighing those encodings that could not
reasonably (or perhaps, easily) be represented in a pure-unicode String
versus those that could. Would it be reasonable to say that if 90% of
Ruby
users would never have a pressing need for a non-unicode-encodable
String,
then an uber-String that’s entirely encoding-agnostic would be better
written as an extension for those special cases? Do we really need to
encumber all of Ruby for the needs of a relative few?

rhaus · June 26, 2006, 10:49pm

On 6/26/06, Izidor J. [email protected] wrote:

On 26.6.2006, at 21:46, Austin Z. wrote:

I doubt this is in the least bit true.
I’m saying that your cure is far worse than disease.
BTW, you may have no problems at all. We, on the other hand, have
lots of problems (in Ruby and other languages) which can be traced to
exactly this hope of “all programmers will be doing lots of manual
work to make things safe for others”. You are deluded.

Um. Not what I’m saying. I want as much clean autoconversion as
possible without being forced into it. But much more than that, I
want an API that works reasonably well with all sorts of encodings. I
want String#[] to work equally well with Mojikyo, ASCII, ISO-8859-12,
and UTF-8.

In environments which already have this cure (internal unicode),
there are no such enormous problems as we experience in those without
this cure. So sucessess and failures I describe are based on real
experience. Unlike your claims, which are just opinions.

No, they’re not just opinions. They’re experiences that I’ve had with
real situations as well where we have a hard time dealing with
autoconversion. Stupid automatic behaviour is worse than manual
behaviour every time.

I couldn’t make sense of your last paragraph.
Well, tell me what exactly do I get when this code executes:

result = File.open( "file ) { |f| f.read( 1000 ) }

Aside from a syntax error from your missing quote?

This would probably be an unencoded String. If you want an encoded
String, you would specify it on the File object either during
construction or afterwards.

The need for ByteArray is nonexistent.

-austin

rhaus · June 26, 2006, 11:16pm

On 26.6.2006, at 22:46, Austin Z. wrote:

This would probably be an unencoded String. If you want an encoded
String, you would specify it on the File object either during
construction or afterwards.

This seems too good to be true

How will e.g. Japanese (or we non-English Europeans), which now use
default $KCODE, write their Ruby scripts? Will we need to specify
encoding in every script for every IO? This can get cumbersome very
fast. Not really Ruby style.

But if there will be some default encoding, it will interfere with
said rules about return values. And that may cause errors when I run
script meant for some other default encoding.

This problem makes me think that rules won’t be so simple as
described now (actually, Matz said that this detail is not fixed yet).

We’ll see. I have just voiced my concerns about separation between
bytes and characters. Must wait for the master to present solution
(and hope he considers these problems)…

izidor

rhaus · June 26, 2006, 11:03pm

Austin Z. wrote:

Um. Not what I’m saying. I want as much clean autoconversion as […]

Clarification question: When you say autoconversion, do you mean:

(A) Automatically convert input strings to a given encoding (independent
of the question of a single vs multiple encodings).

(B) When combining strings, autoconvert incompatible encodings into
compatible encodings before combining.

I was thinking you meant (B), but I get the impression that Austin is
replying to (A) (since Austin’s coerce suggestion sounds a lot like
(B)).

Thanks.

– Jim W.

rhaus · June 26, 2006, 11:20pm

Thanks for the response, Austin. It seemed to help clearify the issues
(at least for me).

Austin Z. wrote:

d1 = File.open(“file.txt”, “rb”, encoding: :utf8) { |f| f.read }

Question: Does the encoding parameter specify the encoding of the file,
or the encoding of the strings you get back (my guess is both).

Related question: In environments that use a lot of different encodings,
are there ways or conventions for specifying the encoding, or do you
just have to “know”.

s1.encoding = :utf8

Another Question: When you set the encoding, are you:

(A) Just changing the encoding specifier without changing the
underlaying string.

(B) Re-encoding the string according to the new encoding specifier.

(B) seems to be implied by the attribute notation, but that seems a bit
dangerous in my mind.

Thanks.

– Jim W.

rhaus · June 26, 2006, 11:22pm

On 26.6.2006, at 23:04, Jim W. wrote:

Clarification question: When you say autoconversion, do you mean:

(A) Automatically convert input strings to a given encoding
(independent
of the question of a single vs multiple encodings).

(B) When combining strings, autoconvert incompatible encodings into
compatible encodings before combining.

Autoconversion (as suggested by many people in this thread) is meant
to convert string in compatible but different encoding to the
encoding of other string (or common compatible superset encoding), to
facilitate the operation using those two strings.

Point A is the great can of worms and source of errors, which I
suggested can be avoided by either:

Very simple and strict rules on String encoding of return values
Introduction of ByteArray as return values

izidor

rhaus · June 26, 2006, 11:44pm

On 6/26/06, Izidor J. [email protected] wrote:

Ahem, no.
100% of Ruby lanuage creators say that they need something better
than Unicode

And if we get both unicode and other stuff, there is no point in
discussing it, no?

Provided we get autoconversion, of course.

All due respect to matz and companyand the wondrous thing they have
wrought,
but nobody is perfect. Accepting a decision blindly based on who is
making
it is a recipe for trouble. My only concern is that while the proposed
m17n
implementation may make Ruby more perfect and more ideal for at least
one
person, it may (emphasis on ‘may’) make it harder for many thousands of
others. Does that make sense? I’m sure there will be those who argue
that
Ruby is matz’s creation and matz’s creation alone, but there’s a lot of
people with a vested interest in “the Ruby way”. A little critical
analysis
of the “benevolent dictator’s” decisions is always prudent.

If we get unicode and it’s a lot harder than people like, or if it
causes
unpleasant compatibility, portability, or interoperability issues, then
we’re no better off.

Hey, the uber-string m17n impl might be the most amazing, remarkable
thing
ever to come along. It just seems based on a lot of anecdotal evidence
that
this approach is very complex and very dangerous, and arguably has never
been done right yet. matz and company are amazing hackers, but is it a
good
risk to take? Is it worth it for 10% of Ruby users or less?

And again, I mean no disrespect by questioning the Ruby elders. It’s
just my
way.

rhaus · June 26, 2006, 11:25pm

On 26.6.2006, at 22:55, Charles O Nutter wrote:

reasonably (or perhaps, easily) be represented in a pure-unicode
String
versus those that could. Would it be reasonable to say that if 90%
of Ruby
users would never have a pressing need for a non-unicode-encodable
String,
then an uber-String that’s entirely encoding-agnostic would be better
written as an extension for those special cases?

Ahem, no.
100% of Ruby lanuage creators say that they need something better
than Unicode

And if we get both unicode and other stuff, there is no point in
discussing it, no?

Provided we get autoconversion, of course.

izidor

rhaus · June 26, 2006, 11:47pm

On 6/26/06, Charles O Nutter [email protected] wrote:

written as an extension for those special cases? Do we really need to
encumber all of Ruby for the needs of a relative few?

I do not believe that this is a viable argument for “killing”. At
best, this is an argument for making sure that Unicode support rock
in Ruby. It doesn’t mean we need to make those “special” cases harder
than they need to be.

-austin

rhaus · June 26, 2006, 11:48pm

On Jun 26, 2006, at 2:15 PM, Izidor J. wrote:

How will e.g. Japanese (or we non-English Europeans), which now use
default $KCODE, write their Ruby scripts? Will we need to specify
encoding in every script for every IO? This can get cumbersome very
fast. Not really Ruby style.

I think that anyone, living in any country, working in any language,
who counts on one global variable to specify the encoding of any file
they might want to read, will very soon have lots of nasty
surprises. Ten years ago, you could do this; no longer. -Tim

rhaus · June 26, 2006, 11:54pm

On 6/26/06, Jim W. [email protected] wrote:

Thanks for the response, Austin. It seemed to help clearify the issues
(at least for me).

Austin Z. wrote:

d1 = File.open(“file.txt”, “rb”, encoding: :utf8) { |f| f.read }
Question: Does the encoding parameter specify the encoding of the file,
or the encoding of the strings you get back (my guess is both).

I would assume both, based on what I’ve seen from Matz.

Related question: In environments that use a lot of different encodings,
are there ways or conventions for specifying the encoding, or do you
just have to “know”.

In my experience, you just have to “know” unless you can do some
detection of the encoding. I think that only UTF-16 or UTF-32 is
really amenable to this This is one of the problems that I’ve seen
with the encoding work that I’ve done. If I’m reading a list of files
from a NetWare server, what encoding is the data in? I don’t
necessarily have a Unicode interface – and my code page may not match
the server’s code page. Whenever you’re dealing with legacy data,
you have to “agree” or guess and hope you’re right.

s1.encoding = :utf8
Another Question: When you set the encoding, are you:

(A) Just changing the encoding specifier without changing the
underlaying string.
(B) Re-encoding the string according to the new encoding specifier.

(B) seems to be implied by the attribute notation, but that seems a bit
dangerous in my mind.

I personally consider it to be (A) because I believe that encoding is
a lens. If you want (B) it should be s1.recode(:utf8). But #recode
would not work on an encoding of “binary” (or “raw”); #recode would be
similar to the Iconv steps you would use today.

-austin

rhaus · June 27, 2006, 1:55am

On 6/27/06, Austin Z. [email protected] wrote:

construction or afterwards.

The need for ByteArray is nonexistent.

…or, to put that another way, when you see “unencoded String”, feel
free to
say “ByteArray” in your head.

;D

rhaus · June 27, 2006, 3:59am

On 6/26/06, Mike S. [email protected] wrote:

s[0] # The first character
s[/./] # The first character
s[byte:0] # The first byte (of a string with some non ascii
compatible encoding)
I kinda like that.
Presumably this is general arm waving, because s[/./] need not return
the first character of a non-empty string, unless you mean s[/./m] or
some uglier alternative

I’m referring to s[byte: 0]. It’s elegant.

-austin

rhaus · June 27, 2006, 3:59am

On 6/26/06, Daniel B. [email protected] wrote:

The need for ByteArray is nonexistent.
…or, to put that another way, when you see “unencoded String”, feel free to
say “ByteArray” in your head.

There’s a point where you’re right. But there’s a point where you’re
wrong. My point is simply that we don’t need a separate class for
this, because character encodings are ways of interpreting a vector
of bytes.

-austin