Unicode roadmap?

rhaus · June 17, 2006, 6:17pm

On Saturday 17 June 2006 16:58, [email protected] wrote:

Full ACK. Ruby programs shouldn’t need to care about the
when you say ‘internal encoding’ are you talking about the text
encoding of Ruby source code?

I’m not Juergen, but since you responded to my message…

First of all Unicode is a character set and UTF-8, UTF-16 etc.
are encodings, that is they specify how a Unicode character is
represented as a series of bits.

At least I am not talking about the encoding of Ruby source
code. The main point of the proposal is to use a single
universal character encoding for all Ruby character strings
(instances of the String class). Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

Consider the “float” data type you will find in most
programming languages: The programmer doesn’t think in terms
of the bits that represent a floating point value. He just
uses the operators provided for floats. He can choose between
different serialization strategies if he needs to serialize
floats. But the operators on floats the programming language
provides don’t care about the different serialization formats,
they all work using the same internal representation.
Conversion is done on IO. Ideally, the same level of
abstraction should be there for character data.

If you have a universal character set (Unicode is an attempt
at this), and an encoding for it, the programming language can
abstract the underlying String representation away. For IO, it
provides methods (i.e. through Encoding objects) that
serialize Strings to a stream of bytes and vice versa.

It seems to me that irrespective of any particular text encoding
scheme you need clean support of a simple byte vector data
structure completely unencumbered with any notion of text encoding
or locale.

I have proposed that further below as Buffer or ByteString.

Right now that is done by the String class, whose name I
think certainly creates much confusion. If the class had been
called Vector and then had methods like:

Vector#size # size in bytes
Vector#str_size # size in characters (encoding and locale
considered)

By providing str_size you are already mixing up the purpose of
your simple byte vector and character strings.

rhaus · June 17, 2006, 5:00pm

On Jun 17, 2006, at 9:50 AM, Stefan L. wrote:

internal string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

I don’t claim to be an Unicode export but shouldn’t the goal be to
have Ruby work with any text encoding on a per-string basis? Why
would you want to force all strings into Unicode for example in a
context where you aren’t using Unicode? (The internal encoding has
to be…). And of course even in the Unicode world you have several
different encodings (UTF-8, UTF-16, and so on). Juergen, when you
say ‘internal encoding’ are you talking about the text encoding of
Ruby source code?

It seems to me that irrespective of any particular text encoding
scheme you need clean support of a simple byte vector data structure
completely unencumbered with any notion of text encoding or locale.
Right now that is done by the String class, whose name I think
certainly creates much confusion. If the class had been called
Vector and then had methods like:

Vector#size		# size in bytes
Vector#str_size 	# size in characters (encoding and locale considered)

I think this discussion would be clearer because it would be the
behavior of the str* methods that would need to understand text
encodings and/or locale settings while the underlying byte vector
methods remained oblivious. The #[] method is the most confusing
since sometimes you want to extract bytes and sometimes you want to
extract sub-strings (i.e consider the encoding). One method, two
interpretations, bad headache.

It seems that three distinct behaviors are being shoehorned (with
good reason) into a single class framework (String):

byte vector
text encoding (encoded sequence of code points)
locale	      (cultural interpretations of the encoded sequence of

code points)

I’m just suggesting that these distinctions seem to be lost in much
of this discussion, especially for folks (like myself) who have a
practical interest in this but certainly aren’t text-encoding gurus.

Gary W.

rhaus · June 17, 2006, 6:38pm

On Jun 17, 2006, at 12:16 PM, Stefan L. wrote:

Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

So all we need is an ideal character set? That sounds simple.

By providing str_size you are already mixing up the purpose of
your simple byte vector and character strings.

Yes. I was pointing out that there were multiple concerns that were
being solved by a single class and I said that there were good
reasons for this. My point was that even if you choose to handle all
those concerns in a single class it was important to keep the
concerns distinct during discussion. Something that I thought wasn’t
happening in this discussion.

I think this is another example of the Humane Interface discussion
started by Martin F. (Bliki
HumaneInterface.html)

In Ruby arrays have an interface that allow them to be used as pure
arrays, as lists, as queue, as stacks and so on instead of having
lots of additional classes.
Similarly I think it makes sense for all M17N issues to be packaged
up in a single class (String) instead of breaking up those concerns
into a class hierarchy.

Gary W.

rhaus · June 17, 2006, 7:37pm

On Saturday 17 June 2006 16:16, Austin Z. wrote:

On 6/17/06, Stefan L. [email protected] wrote:

Full ACK. Ruby programs shouldn’t need to care about the
internal string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

This is incorrect. Most Ruby programs won’t need to care about
the internal string encoding. Experience suggests, however, that it
is most. Definitely not all.

As long as one treats a character string as a character
string, the internal encoding is irrelevant, and as soon as a
decision for an internal string encoding is made, every
programmer can read in the docs “Ruby internally encodes
strings using the XYZ encoding”.

[…]

Unnecessarily complex and inflexible. Before you go too much
further, I really suggest that you look in the archives and
Google to find more about Matz’s m17n String proposal. It’s a
really good one, as it allows developers (both pure Ruby and
extension) to choose what is appropriate with the ability to
transparently convert as well.

I couldn’t find much (in English, I don’t understand
Japanese), do you have a link at hand?

[…]

already.
That is easy to handle with the proposed scheme: Read as much
as you need with the binary interface until you know the
encoding and then do the conversion of the byte buffer to
string. For file input, you can close the file when you have
determined the encoding and reopen it using the “normal”
(character oriented) interface.

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

[…]

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

This would be true regardless of the encoding.

But a conversion from [insert arbitrary encoding here] to
unicode codepoints would be needed.

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.

I wouldn’t mind additionally having:

str.codepoint_at(5)     => a Fixnum

[…]

and the people claiming that Unicode is the Only Way … are wrong.

string of characters) representation. String operations
don’t need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

How does the regular expression engine work then? And all
String methods that have to combine two or more strings in
some way?

[…]

Separation of concerns. I always found it strange that most
dynamic languages simply mix handling of character and arbitrary
binary data (just think of pack/unpack).

The separation makes things harder most of the time.

Why? In which cases?

[…]

It seems that the main argument against using Unicode strings in
Ruby is because Unicode doesn’t work well for eastern countries.
Perhaps there is another character set that works better that we
could use instead of Unicode. The important point here is that
there is only one representation of character data Ruby.

This is a mistake.

OK, Unicode was enough for me until now, but I see that
Unicode is not enough for everyone.

If Unicode is choosen as character set, there is the question
which encoding to use internally. UTF-32 would be a good choice
with regards to simplicity in implementation, since each
codepoint takes a fixed number of bytes. Consider indexing of
Strings:

Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory
environments. This would destroy any chance of that.

I can hardly believe that. There is still the binary IO
interface and ByteString that I proposed. And I still think
that the memory used for pure character data is a small
fraction of the overall memory consumption of typical Ruby
programs.

rhaus · June 17, 2006, 10:37pm

On Sun, Jun 18, 2006 at 01:16:12AM +0900, Stefan L. wrote:

several different encodings (UTF-8, UTF-16, and so on). Juergen,
code. The main point of the proposal is to use a single
universal character encoding for all Ruby character strings
(instances of the String class). Assuming there is an ideal
character set that is really sufficient to represent any
text in this world, it could be used to construct a String
class that abstracts the underlying representation completely
away.

That’s what I meant, yes. And that is the most important point too.

JÃ¼rgen

rhaus · June 17, 2006, 10:34pm

On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul B. wrote:

On 17/06/06, Austin Z. [email protected] wrote:

This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don’t commit to a
particular encoding of Unicode strongly.

This is a wash. I think that it’s better to leave the options open.
After all, it is a hope of mine to have Ruby running on iSeries
(AS/400) and that still uses EBCDIC.

AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

On the other hand, do you really trust all ruby library writers to
accept your strings tagged with EBCDIC encoding? Or do you look
forward to a lot of manual conversions?

Paul.
That’s why I explicitly stated it ties Ruby’s String class to Unicode
Character Code Points, but not to a particular Unicode encoding or
character class, and that was Java’s main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to “extend” it).

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

JÃ¼rgen

rhaus · June 17, 2006, 11:02pm

On 17/06/06, Juergen S. [email protected] wrote:

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

They aren’t so much unsolvable problems as mutually incompatible
approaches. Unicode is concerned with the semantic meaning of a
character, and ignores glyph variations through the ‘Han unification’
process. TRON encoding doesn’t use Han unification: it encodes the
historically-same Chinese character differently for different
languages/regions where they are written differently today. Mojikyo
encodes each graphically distinct character differently and includes a
very wide range of historical characters, and is therefore
particularly suited to certain linguistic and literary niches.

In spite of this, I think that Unicode is an excellent choice for
everyday usage. Unicode does have a solution to the problem of
character variants, but it’s not a universal back end for all
encodings.

Incidentally, it is said that TRON is the world’s most widely-used
operating system, so supporting that encoding is not necessarily a
minor concern.

Paul.

rhaus · June 17, 2006, 11:51pm

On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Z. wrote:

Strings should neither have an internal encoding tag, nor an
external one via $KCODE. The internal encoding should be encapsulated
by the string class completely, except for a few related classes which
may opt to work with the gory details for performance reasons.
The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can’t
think in just terms of a pure Ruby implementation – you must think
in terms of the Ruby/C interface for extensions as well.

I admit I don’t know about Ruby’s C extensions. Are they unable to
access String’s methods? That is all that is needed to work with them.

And since this String class does not have a parametric encoding
attribute, it should be easier to crunch in C even.

fails because your #2 is unacceptable.
Note that explict conversion to characters, arrays, etc, is possible
for any supported character set and encodig. I have even given method
examples. “External” is to be seen in the context of the String class.

case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

I am unaware of Matz’s exact plan. Any good english language links?

I was under the impression users of Matz’ String instances need to
look at the encoding tag to implement eg. #version_sort. If that is
not the case our proposals are not that much different, only Matz’ one
is even more complex to implement than mine.

tradeoff reasons which work transparently together (a bit like FixInt
and BigInt).

Um. Disagree. Matz’s proposed approach does this; yours does not. Yours,
in fact, makes things much harder.

If Matz’s approach requires looking at the encoding tag from the
outside, it is not as transparent as mine. If it isn’t it just boils
down to a parametric class versus subclass hierarchy design decision,
and I don’t see much difference and would be happy with either one.

Because Strings are tightly integrated into the language with the
source reader and are used pervasively, much of this cannot be
provided by add-on libraries, even with open classes. Therefore the
need to have it in Ruby’s canonical String class. This will break some
old uses of String, but now is the right time for that.

“Now” isn’t; Ruby 2.0 is. Maybe Ruby 1.9.1.

My original title, somewhere snipped out, was “A Plan for Unicode
Strings in Ruby 2.0”. I don’t want to rush things or break 1.8 either.

The String class does not worry over character representation
on-screen, the mapping to glyphs must be done by UI frameworks or the
terminal attached to stdout.

The String class doesn’t worry about that now.

I was just playing safe here.

Be flexible.

And little is more flexible than Matz’s m17n String.

I’ve had flexibility with respect to Unicode Standards in mind, to not
fall into traps similiar to Java. A simple to use String class,
powerful enough to include every character of the world was my goal,
with the ability to convert to and from other external (from the
String class’es point of view) representations.

The flexibility to have parametric String encodings inside the String
class was not what I was going for, rather I would have that
inaccessible or at least unneccessary to access for the common String
user, and I provided a somewhat weaker but maybe still sufficient
technique via subclassing.

Remember: POLS is not an acceptable reason for anything. Matz’s m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is “raw” (which would
mean that the other is treated as the defined encoding) or there is a
built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont’t want to worry which ones I’ll have to provide to them, this
is weaker than my approach in this respect, see my next point.

work for Ruby/C interfaced items. Sorry.
Please elaborate this or provide pointers. I cannot believe C cannot
crunch at my Strings, which are less parametric than Matz’s ones are.

whether it’s actually UTF-8 or not until I get HTTP headers – or
worse, a tag. Assuming UTF-8 reading in today’s world
is doomed to failure.

Read it as binary, and decide later. These problems should be locally
containable, and methods are still able to return Strings after
determining the encoding.

tags. Merely that they could. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

Previously you stated users need to look at the encoding to determine
if simple operations like a + b work.

Can you point to more info? I am interested how this pragma stuff
works, and if not doing it “right” can break things.

Disadvantages (with mitigating reasoning of course)

String users need to learn that #byte_length(encoding=:utf8) >=
#size, but that’s not too hard, and applies everywhere. Users do not
need to learn about an encoding tag, which is surely worse to handle
for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply doesn’t know about
the level of problems that can be encountered.

For String concatenates, substring access, search, etc, I expect to be
able to ignore encoding totally. Only when interfacing with
non-String-class objects (I/O and/or explicit conversion) would I need
encoding info.

Strings cannot be used as simple byte buffers any more. Either use
an array of bytes, or an optimized ByteBuffer class. If you need
regular expresson support, RegExp can be extended for ByteBuffers or
even more.

I see no reason for this.

In my proposal, Unicode Strings cannot represent arbitrary binary data
in their internal representation, since not everything would be valid
characters. In fact, you cannot set the internal representation
directly.

The interface could accept a code point sequence of values
(0…255), but that would be wasteful compared to an array of bytes.

Some String operations may perform worse than might be expected from
a naive user, in both the time or space domain. But we do this so the
String user doesn’t need to himself, and are problably better at it
than the user too.

This is a wash.

Only trying to refute weak arguments in advance.

For very simple uses of String, there might be unneccessary
conversions. If a String is just to be passed through somewhere,
without inspecting or modifying it at all, in- and outwards conversion
will still take place. You could and should use a ByteBuffer to avoid
this.

This is a wash.

Not a big problem either, but someone was bound to bring it up.

users really do get unexpected foreign characters in their Strings. I
concluded case folding. I think it is more than that: we are lazy and
understood this could be handled by future Unicode revisions
* [email protected]
The way I see it we have to choose a character set. I proposed
Unicode, because their official goal is to be the one unifying set,
and if they ain’t yet, I hope they’ll be sometime.

If that is not enough we will effectively create our own character
set, let’s call it RubyCode, which will contain characters from the
union of Unicode and a few other sets. Each String will have a
particular encoding, which will determine which characters of RubyCode
are valid in this particular String instance. Hopefully many
characters will be valid in multiple encodings. But it doesn’t sound
like a very clear design to me.

JÃ¼rgen

rhaus · June 18, 2006, 12:16am

On 6/17/06, Stefan L. [email protected] wrote:

internal string encoding is made, every programmer can read in the
docs “Ruby internally encodes strings using the XYZ encoding”.

And I’m saying that it’s a mistake to do that (standardize on a single
encoding). Every programmer will instead be able to read:

“Ruby supports encoded strings in a variety of encodings. The
default behaviour for all strings is XYZ, but this can be
changed and individual strings may be recoded for performance
or compatibility reasons.”

Language and character encodings are hard. Hiding that fact is a
mistake. That doesn’t mean we have to make the APIs difficult, but that
we aren’t going to be buzzworded into compliance, either.

[…]

Unnecessarily complex and inflexible. Before you go too much further,
I really suggest that you look in the archives and Google to find
more about Matz’s m17n String proposal. It’s a really good one, as it
allows developers (both pure Ruby and extension) to choose what is
appropriate with the ability to transparently convert as well.
I couldn’t find much (in English, I don’t understand Japanese), do you
have a link at hand?

I do not. I’ve been reading about this, talking about this, and
discussing it with Matz for the last two years or so, and I’ve been
dealing with Unicode and other character encoding issues extensively at
work. However, the gist of it is that every String is still a byte
vector. Each string will also have an encoding flag. Substrings of a
single character width will always return the String required for the
character. The supported encodings will probably start with UTF-8,
UTF-16, various ISO-8859-* encodings, EUC-JP, SJIS, and other Asian
encodings.

Or do you mean Ruby should determine the encoding automatically? IMO,
that would be bad magic and error-prone.

I mean that what you’re suggesting exposes problems with encoding
stuff extensively and unnecessarily. I certainly wouldn’t want to
program in it if the API involved were as stupid as you’re suggesting it
should be.

[…]

If the strings are represented as a sequence of Unicode codepoints,
it is possible for external libraries to implement more advanced
Unicode operations.
This would be true regardless of the encoding.
But a conversion from [insert arbitrary encoding here] to unicode
codepoints would be needed.

Why? What if the library that I’m interfacing with requires EUC-JP?
Sorry, but Unicode is not necessarily the right answer.

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.
I wouldn’t mind additionally having:
str.codepoint_at(5)     => a Fixnum

Since Ruby isn’t only using Unicode, this isn’t necessarily going to
be possible or meaningful.

[…]

There is only one internal string (where string means a
string of characters) representation. String operations
don’t need to be written for different encodings.
This is still (mostly) correct under the m17n String proposal.
How does the regular expression engine work then? And all
String methods that have to combine two or more strings in
some way?

Matz will have that figured and detailed before he starts writing it.

[…]

Separation of concerns. I always found it strange that most
dynamic languages simply mix handling of character and arbitrary
binary data (just think of pack/unpack).
The separation makes things harder most of the time.
Why? In which cases?

In reality, the separation is not nearly as clean as people who
advocate such separations would like to pretend. It’s less of a problem
in dynamic languages like Ruby, but it’s also far less necessary in
dynamic languages like Ruby. I have found it far more useful to not have
to care whether I’m reading a binary or string value. I despise dealing
with C++ and Java where I am forced to care because of stupid API
design.

[…]

It seems that the main argument against using Unicode strings in
Ruby is because Unicode doesn’t work well for eastern countries.
Perhaps there is another character set that works better that we
could use instead of Unicode. The important point here is that there
is only one representation of character data Ruby.
This is a mistake.
OK, Unicode was enough for me until now, but I see that Unicode is not
enough for everyone.

Thank you. Unicode needs to – will – work very well. I know enough
about Unicode handling to make sure that what I deal with will. But I
have come to believe that choosing a single encoding as your String
representation is a mistake, even if it means making your job harder by
defining and implementing rules for mixed-encoding handling.

consumption of typical Ruby programs.
I can believe it; it’s very domain and program specific, but you’ve just
proposed multiplying the memory usage of that amount of space by four.
(Rails would suffer terribly under your proposal to use UTF-32.)

-austin

rhaus · June 17, 2006, 11:57pm

On 6/17/06, Stefan L. [email protected] wrote:

As long as one treats a character string as a character
string, the internal encoding is irrelevant, and as soon as a

No, it is not.

First for reasons of efficiency. If an application is going to perform
lots of slicing and poking on strings it will want some encoding that
is suiatble for that such as UTF-32. If an application runs on system
with little memory it will want space-efficient encoding (ie UTF-8 or
UTF-16 for Asian languages). And if an appliaction runs on system that
uses some legacy codepage it can read, write, and process all strings
in that codepage. And in JRuby it will be useful to convert strings to
UTF-16 so that the native Java functions can be used for manipulation.

Second, not all characters are equal. If you lived in world where
everything was Unicode you would be fine. But it is not so. Unicode is
suboptimal for encoding CJK characters. So some people might want to
use another encoding for their texts (iirc TRON mentioned earlier is
one of such encodings). In your model you can modify Ruby to use
strings composed of TRON characters instead of Unicode characters. But
how would Unicode Ruby and TRON Ruby exchange strings?
And how would you write an application that handles both TRON and
Unicode? (I suspect TRON would not be much good ie for Runic script)
Such appliaction has to be written very carefully because neither
character set would be subset of the other so it is not possible
converting strings forth and back without thinking. But in your model
such application is not possible at all.

decision for an internal string encoding is made, every
programmer can read in the docs “Ruby internally encodes
strings using the XYZ encoding”.

[…]

I indicated to Juergen, sometimes impossible to determine the
encoding to be used for an IO until you have some data from the IO
already.

That is easy to handle with the proposed scheme: Read as much
as you need with the binary interface until you know the
encoding and then do the conversion of the byte buffer to
string. For file input, you can close the file when you have
determined the encoding and reopen it using the “normal”
(character oriented) interface.

Why reopening or converting if you can simply tag a string that you
had to read anyway?

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

No. But if you read part of html/xml document before the encoding was
specified there is no reason why that part hes to be converted or
reread. You apparently got it right if you were able to determine the
encoding from what you read.

[…]

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

This would be true regardless of the encoding.

But a conversion from [insert arbitrary encoding here] to
unicode codepoints would be needed.

That will be needed anyway. You cannot expect all libraries to use the
arbitrary encoding you chose for Ruby strings.

But if you can choose the encoding of your strings there is nothing
stopping you from converting your strings so that they best suit your
library of choice.

There is only one internal string (where string means a
string of characters) representation. String operations
don’t need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

How does the regular expression engine work then? And all
String methods that have to combine two or more strings in
some way?

If they are both subset of Unicode I see no problem with converting
both to Unicode. If they are incompatible things may break. But that
is because of real incompatibility, not because of some restriction of
the approach.

[…]

Separation of concerns. I always found it strange that most
dynamic languages simply mix handling of character and arbitrary
binary data (just think of pack/unpack).

The separation makes things harder most of the time.

Why? In which cases?

Such as when you have to read sthe start of a HTML page as ByteBuffer
and then convert it to String once you determine the encoding.
Especially if string operations do not exist on the ByteBuffer to
allow parsing it.

I can hardly believe that. There is still the binary IO
interface and ByteString that I proposed. And I still think
that the memory used for pure character data is a small
fraction of the overall memory consumption of typical Ruby
programs.

It depends on the program. For programs that do only text processing
the portion of memory taken by text may be large.

Michal

rhaus · June 18, 2006, 12:22am

On 6/17/06, Juergen S. [email protected] wrote:

On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul B. wrote:

On 17/06/06, Austin Z. [email protected] wrote:

This ties Ruby’s String to Unicode. A safe choice IMHO, or would
we really consider something else? Note that we don’t commit to a
particular encoding of Unicode strongly.
This is a wash. I think that it’s better to leave the options open.
After all, it is a hope of mine to have Ruby running on iSeries
(AS/400) and that still uses EBCDIC.
AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.

On the other hand, do you really trust all ruby library writers to
accept your strings tagged with EBCDIC encoding? Or do you look
forward to a lot of manual conversions?

It depends on the purpose of the library. Very few libraries end up
using byte vectors for strings or completely treat them as such. I would
expect that some of the libraries that I’ve written would work without
any problems in EBCDIC.

Character Code Points, but not to a particular Unicode encoding or
character class, and that was Java’s main folly. (UCS-2 is a
strictly 16 bit per character encoding, but new Unicode standards
specify 21 bit characters, so they had to “extend” it).

Um. Do you mean UTF-32? Because there’s no binary representaiton of
Unicode Character Code Points that isn’t an encoding of some sort. If
that’s the case, that’s unacceptable from a memory representation.

I am unaware of unsolveable problems with Unicode and Eastern
languages, I asked specifically about it. If you think Unicode is
unfixably flawed in this respect, I guess we all should write off
Unicode now rather than later? Can you detail why Unicode is
unacceptable as a single world wide unifying character set?
Especially, are there character sets which cannot be converted to
Unicode and back, which is the main requirement to have Unicode
Strings in a non-Unicode environment?

Legacy data and performance.

-austin

rhaus · June 18, 2006, 12:49am

On 6/17/06, Juergen S. [email protected] wrote:

On Sat, Jun 17, 2006 at 10:52:24PM +0900, Austin Z. wrote:

On 6/17/06, Juergen S. [email protected] wrote:

mean that the other is treated as the defined encoding) or there is a
built-in conversion for them.

Since I probably cannot control which Strings I get from libraries,
and dont’t want to worry which ones I’ll have to provide to them, this
is weaker than my approach in this respect, see my next point.

It’s apparent from the explanation above.
You do not have to look at string encoding or worry which encoding
they are as long as they are compatible (ie iso-8859-1 and utf-8) -
there is a conversion for them. The string methods have to use
(internally) the encoding tag, and you can look if you are interested.
If the strings are incomatible it is a real problem. Not one created
by the implmentation but one originating form the fact that the
strings cannot be automatically converted from one ecoding to another.
But you can keep all your strings, even if they are in several
incompatible encodings. You are not limited to using just one
encoding.

Michal

rhaus · June 18, 2006, 12:25am

On Jun 17, 2006, at 5:48 PM, Juergen S. wrote:

The way I see it we have to choose a character set.

What leads you to this conclusion? I don’t think it can be refuted
that there exists today an almost endless number of character sets
and text encodings in use. I don’t understand why the core facilities
of a language should be intimately tied to any one of those
representations. Once you do that you’ve decided that all other
representations are second class citizens. Why not have the language
be agnostic about these things but still provide a coherent framework
for building libraries and applications that can be locale and
encoding-aware?

Gary W.

rhaus · June 18, 2006, 1:20am

On 18-jun-2006, at 0:21, Austin Z. wrote:

Legacy data and performance.
Yes, you will spend those cycles to count the letters in my language
RIGHT :-)) (evil grin)
It’s actually the most common case when apps damage strings in my
language - their authors wanted to be smart
and conserve. And yes, normalization etc. is complex and you DO
need to have a case-conversion table in memory. Please do have one
(Ruby doesn’t).

No offense, just observation.

rhaus · June 18, 2006, 2:18am

On Saturday 17 June 2006 23:55, Michal S. wrote:

On 6/17/06, Stefan L. [email protected] wrote:
[…]
And if an appliaction runs on system that uses some legacy codepage
it can read, write, and process all strings in that codepage. And
in JRuby it will be useful to convert strings to UTF-16 so that the
native Java functions can be used for manipulation.

If you really need this level of efficiency, Ruby is probably
the wrong language anyway. Regarding JRuby: Of course each
implementation would be free to choose an internal Unicode
encoding. If somebody has enough time and motivation he can
even implement support for multiple encodings and let the user
choose at build-time.

[…]

Or do you mean Ruby should determine the encoding
automatically? IMO, that would be bad magic and error-prone.

No. But if you read part of html/xml document before the encoding
was specified there is no reason why that part hes to be converted
or reread. You apparently got it right if you were able to
determine the encoding from what you read.

The conversion would be done anyway, iff a single internal
encoding was choosen and iff the encoding of the input doesn’t
match the internal encoding.

That will be needed anyway. You cannot expect all libraries to use
the arbitrary encoding you chose for Ruby strings.

I assume you mean C libraries here.

rhaus · June 18, 2006, 1:17am

On 17-jun-2006, at 23:55, Michal S. wrote:

First for reasons of efficiency. If an application is going to perform
lots of slicing and poking on strings it will want some encoding that
is suiatble for that such as UTF-32.
I would much rather prefer UTF-8 in a language such as Ruby which is
often used as glue between
other systems. UTF-8 is used for interchange and it’s indisputable.
If you go for UTF-16 or UTF-32, you are most likely
to convert every single character of text files you read (in text
files present in the wild AFAIK UTF-16 and UTF-32 are a minority,
thanks to the BOM and other setbacks).

If an application runs on system
with little memory it will want space-efficient encoding (ie UTF-8 or
UTF-16 for Asian languages). And if an appliaction runs on system that
uses some legacy codepage it can read, write, and process all strings
in that codepage. And in JRuby it will be useful to convert strings to
UTF-16 so that the native Java functions can be used for manipulation.

n your model you can modify Ruby to use
strings composed of TRON characters instead of Unicode characters. But
how would Unicode Ruby and TRON Ruby exchange strings?

I think Alan Little summed it up very well. The problem with Unicode
in Ruby is strive for perfection
(i.e. satisfy the users of every conceivable or needed encoding).
It’s very noble and I personally can’t imagine it
(even with the “democratic coerce” approach Austin cited). The only
thing I don’t know if a system having this type of handling can be
built at all and how it will interoperate.

Up until now all scripting languages I used somewhat (Perl, Python,
Ruby) allowed all encodings in strings and doing Unicode in them hurts.

Bluntly put, I am selfish and I don’t believe in the “saving grace”
of the M17N (because I just can’t wrap it around my head and I sure
as hell know it’s going to be VERY complex).
It’s also something that bothers me the most about Ruby’s “unicode
discussions” (I’ve read all of them on this list dating back to 2002
because I need it to work NOW) and they
always transcend into this kind of religious discussion in the spirit
of “but your encoding is not good enough”, “but my bad encoding isn’t
that one and I still need it to work” etc.

While for me the greatest thing about Unicode is that it’s Just Good
Enough. And it doesn’t seem Unicode is indeed THAT useless for CJK
languages either
(although I’m sure Paul can correct me - all the 4 languages I am in
control of use only 2 scripting systems with some odd additions here
and there).

And no, I didn’t have a chance to see a TRON system in the wild. If
someone would show me one within 200 km distance I would be glad to
take a look.

rhaus · June 18, 2006, 4:38am

On 6/17/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

Yes, you will spend those cycles to count the letters in my language
RIGHT :-)) (evil grin) It’s actually the most common case when apps
damage strings in my language - their authors wanted to be smart and
conserve. And yes, normalization etc. is complex and you DO need to
have a case-conversion table in memory. Please do have one (Ruby
doesn’t).

I think you’re overthinking the problem. Let’s consider the guarantees
that an m17n String would make:

#size and #length would return the number of glyphs
#[] would return glyphs

Presumably, in Regexen with an m17n String, \w would indicate only
“word” glyphs. Other guarantees would be made along that line.

Therefore, if your input data is UTF-8, anything that deals with #size,
#length, and character-based indexing will just work. The same will
apply to SJIS or any other encoding. The number of times that people are
dealing with mixed-encoding data is vanishingly small, and even when
a developer must, they will probably use a Unicode encoding to
deal with that. But if you’re using SJIS, you’re just going to want use
that.

That’s what the m17n String is about. It’s not about dictating a single
encoding, but enabling people to use Strings intelligently.

No offense, just observation.

I agree – we need full Unicode support. But not at the cost of legacy
code pages in favour of Unicode. It’s not always appropriate.

-austin

rhaus · June 18, 2006, 6:19am

On Jun 17, 2006, at 4:08 AM, Juergen S. wrote:

Strings should deal in characters (code points in Unicode) and not
in bytes, and the public interface should reflect this.

Be careful. People who care about this stuff might want to read
Character Model for the World Wide Web 1.0: Fundamentals It turns out that
characters do not correspond one-to-one with units of sound, or units
of input, or units of display. Except for low-level stuff like
regexps, it’s very difficult to write any code that goes character-at-
a-time that doesn’t contain horrible i18n bugs. For practical
purposes, a String is a more useful basic tool than a character.

Since the String class is quite smart already, it can implement
generally useful and hard (in the domain of Unicode) operations like
case folding, sorting, comparing etc.

Be careful. Case folding is a horrible can of worms, is rarely
implemented correctly, and when it is (the Java library tries really
hard) is insanely expensive. The reason is that case conversion is
not only language-sensitive but jurisdiction sensitive (in some
respects different in France & Québec). Trying to do case-folding on
text that is not known to be ASCII is likely a symptom of a bug.

This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don’t commit to a
particular encoding of Unicode strongly.

For information: The XML view is that Shift-JIS, KOI8-R, EBCDIC, and
many others are all encodings of Unicode and a best effort should be
made to accept and emit all sane encodings on demand. Most XML
software sticks to a single encoding, internally.

-Tim

rhaus · June 18, 2006, 6:36am

On Jun 17, 2006, at 6:52 AM, Austin Z. wrote:

The internal encoding has to be decided, probably between UTF-8,
UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can’t
think in just terms of a pure Ruby implementation – you must think
in terms of the Ruby/C interface for extensions as well.

Point of information: Of all the widely-used methods of encoding
international strings, UTF-8 is by far the easiest to deal with in C.

Trust me on this
one: I have done some low-level encoding work. Additionally, even
though I might have marked a network object as “UTF-8”, I may not know
whether it’s actually UTF-8 or not until

That’s an incredibly important point in a networked world. One of
the reasons XML has had so much success, probably more than it
deserves, is that its encoding is self-descriptive. To quote Larry
Wall: “An XML document knows what encoding it’s in.” Since HTTP
headers are (sigh) known to be wrong on occasion, this is a pretty
big value-add.

This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
really consider something else? Note that we don’t commit to a
particular encoding of Unicode strongly.

This is a wash. I think that it’s better to leave the options open.
After all, it is a hope of mine to have Ruby running on iSeries
(AS/400) and that still uses EBCDIC.

EBCDIC is in fact an encoding of Unicode. Just saying that it’s
necessary to be clear both as to what character set is being
supported, and what limitations on encoding are enforced.

-Tim

rhaus · June 18, 2006, 6:29am

On Jun 17, 2006, at 6:50 AM, Stefan L. wrote:

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn’t work well for eastern
countries.

Point of information: there are highly successful word-processing
products selling well in countries whose writing systems include Han
characters, which internally use Unicode. So while the Han-
unification problems have been much discussed and are regarded as
important by people who are not fools, in fact there is existence
proof that Unicode does work well enough for wide deployment in
commercial software.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,

UTF-32 has a practical problem in that in C code, you can’t use strcmp
() and friends because it’s full of null bytes. Of course if you’re
careful to code everything using wchar_t you’ll be OK, but lots of
code isn’t. (UTF-8 doesn’t have this problem and is much more compact).

Consider
indexing of Strings:
    "some string"[4]
If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array

Correct. But in practice this seems not to be too huge a problem,
since in practice text is most often accessed sequentially. The
times that you really need true random access to the N’th character
are rare enough that for some problems, the advantages of UTF-8 are
big enough to compensate for this problem. Note that in a variable-
length character encoding, there’s no trouble whatever with a table
of pointers into text; the only problem is when you need to find
the Nth character cheaply.

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Be careful. There are almost no pure ASCII files left. Café.
Ordoñez. ?Smart quotes?

-Tim