Unicode roadmap?

On 15-jun-2006, at 3:50, Charles O Nutter wrote:

operations work correctly treating those characters as string,
would that be a better ideal? Where are the breaking points in such
a design? What’s to stop the underlying implementation from
actually using a UTF-16 character, passing UTF-8 to libraries and
IO streams but still allowing you to access everything as UTF-16 or
your encoding of choice? (Of course this is somewhat rhetorical; we
do this currently with JRuby since Java’s scrints are UTF-16…we
just don’t have any way to provide access to UTF-16 characters, and
we normalize everything to UTF-8 for Ruby’s sake…but what if we
didn’t normalize and adjusted string functions to compensate?)

This is more appropriate for ruby-talk


Julian ‘Julik’ Tarkhanov
please send all personal mail to
me at julik.nl

I agree it’s a very attractive solution. I have two questions related
(perhaps you are out there to answer, Julik):

  1. How does performance look with the unicode string add-on versus
    native
    strings?
  2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows…if we could assume that switching
from treating a string as an array of bytes to a list of characters of
arbitrary width, and have all existing string operations work correctly
treating those characters as string, would that be a better ideal? Where
are
the breaking points in such a design? What’s to stop the underlying
implementation from actually using a UTF-16 character, passing UTF-8 to
libraries and IO streams but still allowing you to access everything as
UTF-16 or your encoding of choice? (Of course this is somewhat
rhetorical;
we do this currently with JRuby since Java’s scrints are UTF-16…we
just
don’t have any way to provide access to UTF-16 characters, and we
normalize
everything to UTF-8 for Ruby’s sake…but what if we didn’t normalize
and
adjusted string functions to compensate?)

I believe that Julik’s way of solving the unicode problem (String#u
providing access to a unicode helper) is very attractive. I have two
questions related, for Julik and the rest of the peanut gallery:

  1. How does performance look with the unicode string add-on versus
    native
    strings (or as compared to icu4r, which is C-based)?
  2. Is this the ideal way to support unicode strings in ruby?

And I explain the second as follows…if we could assume switching from
treating a string as an array of bytes to a list of characters of
arbitrary
width, and have all existing string operations work correctly treating
those
characters as indexed elements of that string, would that be a better
ideal?
Where are the breaking points in such a design? What’s to stop the
underlying implementation from actually using a UTF-16 character,
passing
UTF-8 to libraries and IO streams but still allowing you to access
everything as UTF-16 or your encoding of choice? Is it simply libraries
or
core APIs that explicitly need byte counts? (Of course this is
somewhat
rhetorical; we do this currently with JRuby since Java’s strings are
UTF-16…we just don’t have any uniform way to provide access to UTF-16
character strings, and we normalize everything to UTF-8 for Ruby’s
sake…but what if we didn’t normalize and adjusted string functions to
compensate?)

Fair enough; redirected. If any other rails-core folks want to chime in,
please do so…I would expect unicode and multibyte are key issues for
worldwide rails deployments.

On 6/14/06, Charles O Nutter [email protected] wrote:

I believe that Julik’s way of solving the unicode problem (String#u
providing access to a unicode helper) is very attractive. I have two
questions related, for Julik and the rest of the peanut gallery:

  1. How does performance look with the unicode string add-on versus native
    strings (or as compared to icu4r, which is C-based)?
  2. Is this the ideal way to support unicode strings in ruby?

No. In fact, I believe that Matz has the right idea for M17N strings
in Ruby 2.0. The reality is that there’s a lot of data out there
that isn’t Unicode.

I would suggest that JRuby could offer a JavaString that acts in every
way like a String except that it provides access to the native UTF-16
implementation.

-austin

On 15-jun-2006, at 4:40, Austin Z. wrote:

No. In fact, I believe that Matz has the right idea for M17N strings
in Ruby 2.0. The reality is that there’s a lot of data out there
that isn’t Unicode.

It’s very difficult for me to understand the implementation. What if
we concat a Mojikyo string to a UTF8String? UnicodeDecodeError,
ordinal not in range?
I think Python folks proved that it’s terrible (it is).
Nothing is ideal.

I would suggest that JRuby could offer a JavaString that acts in every
way like a String except that it provides access to the native UTF-16
implementation.

Just what the ICU4R extension does. It’s unusable to the point that
you cannot concat a native string with a UString.
To the point that you have to use special Regexp class for it. You
end up having half of your Ruby script doing typecasting from one to
the other.

There is alot of data that isn’t Unicode, indeed. Converted on input
and converted on output if necessary - just as in any
other case when the encoding of your system doesn’t match your input
or output. I don’t know if it can be possible to have the “internal”
encoding of a system
switchable (seems to me this is what Matz wants) - then you can’t
safely refer to anything other than bytes. And then you get software
that you can’t use, because they had a different assumtpion than you had
as to what encoding the user will be using.

On 6/14/06, Austin Z. [email protected] wrote:

in Ruby 2.0. The reality is that there’s a lot of data out there
that isn’t Unicode.

Yes, we all understand that Ruby 2.0 will be the coolest thing since
sliced bread, but those of us that are currently developing
international websites with Rails don’t have the luxury of waiting
until Christmas of 2007.

-PJ Hyett
http://pjhyett.com

On 6/14/06, PJ Hyett [email protected] wrote:

that isn’t Unicode.
Yes, we all understand that Ruby 2.0 will be the coolest thing since
sliced bread, but those of us that are currently developing
international websites with Rails don’t have the luxury of waiting
until Christmas of 2007.

shrug

As far as I can tell, there will be no implementation of Ruby before
then that has a “native” m17n string.

So whether you have the luxury of waiting or not, Ruby 1.8.x will not
ever have a “Unicode string”.

Adding a “Unicode string” would break behaviour, and no example is
better than the extension that was proposed which would change the
meaning of #size and #length to mean two different things.

So, there’s a point where patience is going to be necessary, whether
you “have the luxury” or not.

-austin

On 6/14/06, Austin Z. [email protected] wrote:

individual string level so that you could meaningfully have Unicode
(probably UTF-8) and ShiftJIS strings in the same data and still
meaningfully call #length on them.

You will always have to care about the encoding. As well as,
ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, … the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal

IIRC, Matz has said that internally String won’t change, and I suspect
that
a CharString class (or smth like) won’t be ever added.

Maybe just introducing String#encoding flag and addig new methods to
String
with prefixes, like char_array, char_slice, char_length, char_index,
char_downcase, char_strcoll, char_strip, etc. that will internally look
at
encoding flag and process respectively bytes in this particular string
without conversion (just maybe some hidden), and leaving old
byte-processing methods intact, would be the way to keep older code
working
and enjoy M17N?

Though, as for me, it is still unclear, what should happen, if one tries
to
perform operation on two strings with different String#encoding…

On 6/15/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

  1. Preferably separate (and strictly purposed) Bytestring that you
    get out of Sockets and use in Servers etc. - or the ability to
    “force” all strings recieved from external resources to be flagged
    uniformly as being of a certain encoding in your program, not
    somewhere in someone’s library. If flags have to be set by libraries,
    they won’t be set because most developers sadly don’t care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
unicodification

Where else should the strings be flagged? If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

setting such as $KCODE.
I do not see why libraries should be always wrong. After all, you can
always fix them. And setting the encoding globally is a bad thing. You
cannot have strings encoded in different encodings in one process
then. It looks quite limiting. For one, the web pages that you get
from various servers (and even the same server) can be in varoius
encodings.

Thanks

Michal

On 15-jun-2006, at 13:21, Michal S. wrote:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
unicodification

Where else should the strings be flagged?
They should nog be flagged, because some strings will be flagged and
some won’t and exactly
in the wrong places at the wrong time. See is_uf_8 in Perl to
witness the terrible ugliness of this.

If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

That’s why you should distinguish between a ByteArray and a String.

libraries I use will be getting it wrong - see above) or by a global
setting such as $KCODE.

I do not see why libraries should be always wrong. After all, you can
always fix them. And setting the encoding globally is a bad thing. You
cannot have strings encoded in different encodings in one process
then. It looks quite limiting. For one, the web pages that you get
from various servers (and even the same server) can be in varoius
encodings.

Of course they can (and will). When I have to approach this I usually
just snif the encoding of the strings I recieved and then feed them
to iconv and friends before doing any processing. A library that
downloads stuff off the Internet should be (IMO) aware of
the charset madness and decode the strings for me.

Trust me, when multibyte/Unicode handling is optional, 80% of
libraries do it wrong. Re-read the links above if you don’t believe.

Actually it seems that the solution with an accessor is quite nice,
but that I had to figure out the hard way after breaking the String
class
with my hacks and seeing stuff collapse. Apparently the poster of a
parallel thread finds it inspiring to repeat my experiment in vitro
just for
the academic sake of it.

On 6/15/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

somewhere in someone’s library. If flags have to be set by libraries,
they won’t be set because most developers sadly don’t care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
unicodification

Where else should the strings be flagged?
They should nog be flagged, because some strings will be flagged and
some won’t and exactly
in the wrong places at the wrong time. See is_uf_8 in Perl to
witness the terrible ugliness of this.

You can certainly get the things wrong. But if you get a string that
is wrongly flagged you have the choice to fix the code where the
string originates or work arond it by flagging it right.
If you have a code that gets the encoding wrong, and it tries to
convert the string to some ‘universal’ encoding you want to use
everywhere in your application, you get a broken string.

If you get a web page
through http request, and the library parses the response for you, it
should set enconding on the web page. You would never know since you
only received the page, not the header.

That’s why you should distinguish between a ByteArray and a String.

How does it help you here?

All of this can be controlled either per String (then 99 out of 100
Of course they can (and will). When I have to approach this I usually
just snif the encoding of the strings I recieved and then feed them
to iconv and friends before doing any processing. A library that
downloads stuff off the Internet should be (IMO) aware of
the charset madness and decode the strings for me.

If it can decode them, it can flag them. It has to be aware - that’s it.

Trust me, when multibyte/Unicode handling is optional, 80% of
libraries do it wrong. Re-read the links above if you don’t believe.

But they get the very foundation wrong. In Python functions that take
multiple strings can only thake them in one encoding. It is impossible
to concatenate differently encoded strings. Of course, this is bound
to fail.
In the other case they use a database with poor support for unicode,
and mysql that does exactly the same thing ruby does right now - works
with strings as arrays of bytes. Of course, this is going to break.

Neither is the case when the strings carry information about their
encoding, and the string functions can handle strings encoded
differently.

The fact that there are libraries and languages with poor unicode
support does not mean it must be always poor.

Thanks

Michal

On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal S. wrote:

stream is encoded. This will sort of be like $KCODE but on an

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

I empathically agree. I’ll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

  1. Strings should deal in characters (code points in Unicode) and not
    in bytes, and the public interface should reflect this.

  2. Strings should neither have an internal encoding tag, nor an
    external one via $KCODE. The internal encoding should be encapsulated
    by the string class completely, except for a few related classes which
    may opt to work with the gory details for performance reasons.
    The internal encoding has to be decided, probably between UTF-8,
    UTF-16, and UTF-32 by the String class implementor.

  3. Whenever Strings are read or written to/from an external source,
    their data needs to be converted. The String class encapsulates the
    encoding framework, likely with additional helper Modules or Classes
    per external encoding. Some methods take an optional encoding
    parameter, like #char(index, encoding=:utf8), or
    #to_ary(encoding=:utf8), which can be used as helper Class or Module
    selector.

  4. IO instances are associated with a (modifyable) encoding. For
    stdin, stdout this can be derived from the locale settings. String-IO
    operations work as expected.

  5. Since the String class is quite smart already, it can implement
    generally useful and hard (in the domain of Unicode) operations like
    case folding, sorting, comparing etc.

  6. More exotic operations can easily be provided by additional
    libraries because of Ruby’s open classes. Those operations may be
    coded depending on on String’s public interface for simplicissity, or
    work with the internal representation directly for performance.

  7. This approach leaves open the possibility of String subclasses
    implementing different internal encodings for performance/space
    tradeoff reasons which work transparently together (a bit like FixInt
    and BigInt).

  8. Because Strings are tightly integrated into the language with the
    source reader and are used pervasively, much of this cannot be
    provided by add-on libraries, even with open classes. Therefore the
    need to have it in Ruby’s canonical String class. This will break some
    old uses of String, but now is the right time for that.

  9. The String class does not worry over character representation
    on-screen, the mapping to glyphs must be done by UI frameworks or the
    terminal attached to stdout.

  10. Be flexible.

This approach has several advantages and a few disadvantages, and I’ll
try to bring in some new angles to this now too:

Advantages

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

-Cross Library Transparency-

No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly? This is a big deal
IMNSHO.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

-Correct String Operations-

Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don’t care,
don’t know, or have no time. And these mistakes may be security
sensitive, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.

Disadvantages (with mitigating reasoning of course)

  • String users need to learn that #byte_length(encoding=:utf8) >=
    #size, but that’s not too hard, and applies everywhere. Users do not
    need to learn about an encoding tag, which is surely worse to handle
    for them.

  • Strings cannot be used as simple byte buffers any more. Either use
    an array of bytes, or an optimized ByteBuffer class. If you need
    regular expresson support, RegExp can be extended for ByteBuffers or
    even more.

  • Some String operations may perform worse than might be expected from
    a naive user, in both the time or space domain. But we do this so the
    String user doesn’t need to himself, and are problably better at it
    than the user too.

  • For very simple uses of String, there might be unneccessary
    conversions. If a String is just to be passed through somewhere,
    without inspecting or modifying it at all, in- and outwards conversion
    will still take place. You could and should use a ByteBuffer to avoid
    this.

  • This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
    really consider something else? Note that we don’t commit to a
    particular encoding of Unicode strongly.

  • More work and time to implement. Some could call it
    over-engineered. But it will save a lot of time and troubles when shit
    hits the fan and users really do get unexpected foreign characters in
    their Strings. I could offer help implementing it, although I have
    never looked at ruby’s source, C-extensions, or even done a lot of
    ruby programming yet.

Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let’s ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don’t recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don’t see how it
requires our Strings to stay dumb byte buffers.

Jürgen

On Saturday 17 June 2006 13:08, Juergen S. wrote:

On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal S. wrote:
[…]

The string methods should not just blindly operate on bytes but
use the encoding information to operate on characters rather than
bytes. Sure something like byte_length is needed when the string
is stored somewhere outside Ruby but standard string methods
should work with character offsets and characters, not byte
offsets nor bytes.

I empathically agree. I’ll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

  1. Strings should deal in characters (code points in Unicode) and
    not in bytes, and the public interface should reflect this.

  2. Strings should neither have an internal encoding tag, nor an
    external one via $KCODE. The internal encoding should be
    encapsulated by the string class completely, except for a few
    related classes which may opt to work with the gory details for
    performance reasons. The internal encoding has to be decided,
    probably between UTF-8, UTF-16, and UTF-32 by the String class
    implementor.

Full ACK. Ruby programs shouldn’t need to care about the
internal string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

  1. Whenever Strings are read or written to/from an external source,
    their data needs to be converted. The String class encapsulates the
    encoding framework, likely with additional helper Modules or
    Classes per external encoding. Some methods take an optional
    encoding parameter, like #char(index, encoding=:utf8), or
    #to_ary(encoding=:utf8), which can be used as helper Class or
    Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

my_character_str = Encoding::UTF8.encode(my_byte_buffer)
buffer = Encoding::UTF8.decode(my_character_str)
  1. IO instances are associated with a (modifyable) encoding. For
    stdin, stdout this can be derived from the locale settings.
    String-IO operations work as expected.

I propose one of:

  1. A low level IO API that reads/writes ByteBuffers. String IO
    can be implemented on top of this byte-oriented API.

    The basic binary IO methods could look like:

    binfile = BinaryIO.new("/some/file", “r”)
    buffer = binfile.read_buffer(1024) # read 1K of binary data

    binfile = BinaryIO.new("/some/file", “w”)
    binfile.write_buffer(buffer) # Write the byte buffer

    The standard File class (or IO module, whatever) has an
    encoding attribute. The default value is set by the
    constructor by querying OS settings (on my Linux system
    this could be $LANG):

    read strings from /some/file, assuming it is encoded

    in the systems default encoding.

    text_file = File.new("/some/file", “r”)
    contents = text_file.read

    alternatively one can explicitely set an encoding before

    the first read/write:

    text_file = File.new("/some/file", “r”)
    text_file.encoding = Encoding::UTF8

    The File class (or IO module) will probably use a BinaryIO
    instance internally.

  2. The File class/IO module as of current Ruby just gets
    additional methods for binary IO (through ByteBuffers) and
    an encoding attribute. The methods that do binary IO don’t
    need to care about the encoding attribute.

I think 1) is cleaner.

  1. Since the String class is quite smart already, it can implement
    generally useful and hard (in the domain of Unicode) operations
    like case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

Since IMO a new “character” class would be overkill, I propose
that the String class provides codepoint-wise iteration (and
indexing) by representing a codepoint as a Fixnum. AFAIK a
Fixnum consists of 31 bits on a 32 bit machine, which is
enough to represent the whole range of unicode codepoints.

  1. More exotic operations can easily be provided by additional
    libraries because of Ruby’s open classes. Those operations may be
    coded depending on on String’s public interface for simplicissity,
    or work with the internal representation directly for performance.

  2. This approach leaves open the possibility of String subclasses
    implementing different internal encodings for performance/space
    tradeoff reasons which work transparently together (a bit like
    FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

  1. Be flexible.
    The advantages of this proposal over the current situation and
    tagging a string with an encoding are:
  • There is only one internal string (where string means a
    string of characters) representation. String operations
    don’t need to be written for different encodings.

  • No need for $KCODE.

  • Higher abstraction.

  • Separation of concerns. I always found it strange that most
    dynamic languages simply mix handling of character and
    arbitrary binary data (just think of pack/unpack).

  • Reading of character data in one encoding and representing
    it in other encoding(s) would be easy.

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn’t work well for eastern
countries. Perhaps there is another character set that works
better that we could use instead of Unicode. The important
point here is that there is only one representation of
character data Ruby.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,
since each codepoint takes a fixed number of bytes. Consider
indexing of Strings:

    "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array
lookup, since any codepoint before the fifth could occupy more
than one (8 bit or 16 bit) unit. Of course there is the
argument against UTF-32 that it takes to much memory. But I
think that most text-processing done in Ruby spends much more
memory on other data structures than in actual character data
(just consider an REXML document), but I haven’t measured that
:wink:

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Thank you for reading so far. Just in case Matz decides to
implement something similar to this proposal, I am willing to
help with Ruby development (although I don’t know much about
Ruby’s internals and not too much about Unicode either).

I do not have a CS degree and I’m not a Unicode expert, so
perhaps the proposal is garbage, in this case please tell me
what is wrong about it or why it is not realistic to implement
it.

On 17-jun-2006, at 15:52, Austin Z. wrote:

  1. Because Strings are tightly integrated into the language with the
    source reader and are used pervasively, much of this cannot be
    provided by add-on libraries, even with open classes. Therefore the
    need to have it in Ruby’s canonical String class. This will break
    some
    old uses of String, but now is the right time for that.

“Now” isn’t; Ruby 2.0 is. Maybe Ruby 1.9.1.

Most probably wise, but I need casefolding and character classes to
work since yesteryear.
Oniguruma is there but even if you complie with it (which is not the
default, still) you don’t get char classes (AFAIK)
and you don’t get casefolding. Case-insensitive search/replace
quickly becomes bondage.

I am maintaining a gem whose test fails due to different regexps in
Oniguruma, but I would be able to quickly fix it knowing that
Oniguruma is in stable now.

  1. Be flexible.

And little is more flexible than Matz’s m17n String.

I couldn’t find a proper description of that - as I told already, the
thing I’d least prefer would be

get a string from the database

p str + my_unicode_chars # Ok, bail out with an ugly exception
because the author of the DB adaptor didn’t care to send me proper
Strings…

If strings in the system are allowed to have varying encodings, I
don’t understand how the engine is going to upgrade/downgrade strings
automatically.
Especially remembering that the receiver is on the left, so I
actually might get different exceptions going as I do

p my_unicode_chars + mojikyo_str # who wins?

or

p mojikyo_str + my_unicode_chars # who wins?

or (especially)

p mojikyo_str +
bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
not # who wins?

On 6/17/06, Juergen S. [email protected] wrote:

I empathically agree. I’ll even repeat and propose a new Plan for
Unicode Strings in Ruby 2.0 in 10 points:

  1. Strings should deal in characters (code points in Unicode) and not
    in bytes, and the public interface should reflect this.

Agree, mostly. Strings should have a way to indicate the buffer size of
the String.

  1. Strings should neither have an internal encoding tag, nor an
    external one via $KCODE. The internal encoding should be encapsulated
    by the string class completely, except for a few related classes which
    may opt to work with the gory details for performance reasons.
    The internal encoding has to be decided, probably between UTF-8,
    UTF-16, and UTF-32 by the String class implementor.

Completely disagree. Matz has the right choice on this one. You can’t
think in just terms of a pure Ruby implementation – you must think
in terms of the Ruby/C interface for extensions as well.

  1. Whenever Strings are read or written to/from an external source,
    their data needs to be converted. The String class encapsulates the
    encoding framework, likely with additional helper Modules or Classes
    per external encoding. Some methods take an optional encoding
    parameter, like #char(index, encoding=:utf8), or
    #to_ary(encoding=:utf8), which can be used as helper Class or Module
    selector.

Conversion should be possible at any time. An “external source” may be
an extension that your Ruby program can’t distinguish. Again, this point
fails because your #2 is unacceptable.

  1. IO instances are associated with a (modifyable) encoding. For
    stdin, stdout this can be derived from the locale settings. String-IO
    operations work as expected.

Agree, realising that the internal implementation of String must be
completely different than you’ve suggested. It is also important to
retain raw reading; a JPEG should not be interpreted as Unicode.

  1. Since the String class is quite smart already, it can implement
    generally useful and hard (in the domain of Unicode) operations like
    case folding, sorting, comparing etc.

Agreed, but this would be expected regardless of the actual encoding of
a String.

  1. More exotic operations can easily be provided by additional
    libraries because of Ruby’s open classes. Those operations may be
    coded depending on on String’s public interface for simplicissity, or
    work with the internal representation directly for performance.

Agreed.

  1. This approach leaves open the possibility of String subclasses
    implementing different internal encodings for performance/space
    tradeoff reasons which work transparently together (a bit like FixInt
    and BigInt).

Um. Disagree. Matz’s proposed approach does this; yours does not. Yours,
in fact, makes things much harder.

  1. Because Strings are tightly integrated into the language with the
    source reader and are used pervasively, much of this cannot be
    provided by add-on libraries, even with open classes. Therefore the
    need to have it in Ruby’s canonical String class. This will break some
    old uses of String, but now is the right time for that.

“Now” isn’t; Ruby 2.0 is. Maybe Ruby 1.9.1.

  1. The String class does not worry over character representation
    on-screen, the mapping to glyphs must be done by UI frameworks or the
    terminal attached to stdout.

The String class doesn’t worry about that now.

  1. Be flexible.

And little is more flexible than Matz’s m17n String.

This approach has several advantages and a few disadvantages, and I’ll
try to bring in some new angles to this now too:

Advantages

-POL, Encapsulation-

All Strings behave exactly the same everywhere, are predictable,
and do the hard work for their users.

Remember: POLS is not an acceptable reason for anything. Matz’s m17n
Strings would be predictable, too. a + b would be possible if and only
if a and b are the same encoding or one of them is “raw” (which would
mean that the other is treated as the defined encoding) or there is a
built-in conversion for them.

-Cross Library Transparency-
No String user needs to worry which Strings to pass to a library, or
worry which Strings he will get from a library. With Web-facing
libraries like rails returning encoding-tagged Strings, you would be
likely to get Strings of all possible encodings otherwise, and isthe
String user prepared to deal with this properly? This is a big deal
IMNSHO.

This will be true with m17n strings. However, your proposal does not
work for Ruby/C interfaced items. Sorry.

-Limited Conversions-

Encoding conversions are limited to the time Strings are created or
written or explicitly transformed to an external representation.

This is a mistake. I may need to know the internal representation of a
particular encoding of a String inside of a program. Trust me on this
one: I have done some low-level encoding work. Additionally, even
though I might have marked a network object as “UTF-8”, I may not know
whether it’s actually UTF-8 or not until I get HTTP headers – or
worse, a tag. Assuming UTF-8 reading in today’s world
is doomed to failure.

-Correct String Operations-
Even basic String operations are very hard in the world of Unicode. If
we leave the String users to look at the encoding tags and sort it out
themselves, they are bound to make mistakes because they don’t care,
don’t know, or have no time. And these mistakes may be security
sensitive, since most often credentials are represented as Strings
too. There already have been exploits related to Unicode.

This is a misunderstanding on your part. Nothing about Matz’s m17n
Strings suggests that String users would have to look at the encoding
tags. Merely that they could. I suspect that there will be pragma-
like behaviours to enforce a particular internal representation at all
times.

Disadvantages (with mitigating reasoning of course)

  • String users need to learn that #byte_length(encoding=:utf8) >=
    #size, but that’s not too hard, and applies everywhere. Users do not
    need to learn about an encoding tag, which is surely worse to handle
    for them.

True, but the encoding tag is not worse. Anyone who assumes that
developers can ignore encoding at any time simply doesn’t know about
the level of problems that can be encountered.

  • Strings cannot be used as simple byte buffers any more. Either use
    an array of bytes, or an optimized ByteBuffer class. If you need
    regular expresson support, RegExp can be extended for ByteBuffers or
    even more.

I see no reason for this.

  • Some String operations may perform worse than might be expected from
    a naive user, in both the time or space domain. But we do this so the
    String user doesn’t need to himself, and are problably better at it
    than the user too.

This is a wash.

  • For very simple uses of String, there might be unneccessary
    conversions. If a String is just to be passed through somewhere,
    without inspecting or modifying it at all, in- and outwards conversion
    will still take place. You could and should use a ByteBuffer to avoid
    this.

This is a wash.

  • This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
    really consider something else? Note that we don’t commit to a
    particular encoding of Unicode strongly.

This is a wash. I think that it’s better to leave the options open.
After all, it is a hope of mine to have Ruby running on iSeries
(AS/400) and that still uses EBCDIC.

  • More work and time to implement. Some could call it over-engineered.
    But it will save a lot of time and troubles when shit hits the fan and
    users really do get unexpected foreign characters in their Strings. I
    could offer help implementing it, although I have never looked at
    ruby’s source, C-extensions, or even done a lot of ruby programming
    yet.

I would call it the amount of work necessary. But the work needs to be
done for a variety of encodings, and not just Unicode. Especially
because of C extensions.

Close to the start of this discussion Matz asked what the problem with
current strings really was for western users. Somewhere later he
concluded case folding. I think it is more than that: we are lazy and
expect character handling to be always as easy as with 7 bit ASCII, or
as close as possible. Fixed 8-bit codepages worked quite fine most of
the time in this regard, and breakage was limited to special
characters only.

Now let’s ask the question in reverse: are eastern programmers so used
to doing elaborate byte-stream to character handling by hand they
don’t recognize how hard this is any more? Surely it is a target for
DRY if I ever saw one. Or are there actual problems not solveable this
way? I looked up the mentioned Han-Unification issue, and as far as I
understood this could be handled by future Unicode revisions
allocating more characters, outside of Ruby, but I don’t see how it
requires our Strings to stay dumb byte buffers.

No one has ever suggested that Ruby Strings stay byte buffers. However,
blindly choosing Unicode adds unnecessary complexity to the situation.

-austin

On 6/17/06, Stefan L. [email protected] wrote:

Full ACK. Ruby programs shouldn’t need to care about the
internal string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

This is incorrect. Most Ruby programs won’t need to care about the
internal string encoding. Experience suggests, however, that it is
most. Definitely not all.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

my_character_str = Encoding::UTF8.encode(my_byte_buffer)
buffer = Encoding::UTF8.decode(my_character_str)

Unnecessarily complex and inflexible. Before you go too much further, I
really suggest that you look in the archives and Google to find more
about Matz’s m17n String proposal. It’s a really good one, as it allows
developers (both pure Ruby and extension) to choose what is appropriate
with the ability to transparently convert as well.

  1. IO instances are associated with a (modifyable) encoding. For
    stdin, stdout this can be derived from the locale settings.
    String-IO operations work as expected.

I propose one of:

  1. A low level IO API that reads/writes ByteBuffers. String IO
    can be implemented on top of this byte-oriented API.

[…]

  1. The File class/IO module as of current Ruby just gets
    additional methods for binary IO (through ByteBuffers) and
    an encoding attribute. The methods that do binary IO don’t
    need to care about the encoding attribute.

I think 1) is cleaner.

I think neither is necessary and both would be a mistake. It is, as I
indicated to Juergen, sometimes impossible to determine the encoding
to be used for an IO until you have some data from the IO already.

  1. Since the String class is quite smart already, it can implement
    generally useful and hard (in the domain of Unicode) operations like
    case folding, sorting, comparing etc.
    If the strings are represented as a sequence of Unicode codepoints, it
    is possible for external libraries to implement more advanced Unicode
    operations.

This would be true regardless of the encoding.

Since IMO a new “character” class would be overkill, I propose that
the String class provides codepoint-wise iteration (and indexing) by
representing a codepoint as a Fixnum. AFAIK a Fixnum consists of 31
bits on a 32 bit machine, which is enough to represent the whole range
of unicode codepoints.

This does not match what Matz will be doing.

str = “Fran\303\247ais”
str[5] # → “\303\247”

This is better than doing a Fixnum representation. It is character
iteration, but each character is, itself, a String.

  1. This approach leaves open the possibility of String subclasses
    implementing different internal encodings for performance/space
    tradeoff reasons which work transparently together (a bit like
    FixInt and BigInt).
    I think providing different internal String representations
    would be too much work, especially for maintenance in the long
    run.

If you’re depending on classes to do that, especially given that Ruby’s
String, Array, and Hash classes don’t inherit well, you’re right.

The advantages of this proposal over the current situation and
tagging a string with an encoding are:

The problem, of course, is that this proposal – and your take on it –
don’t account for the m17n String that Matz has planned. The current
situation is a mess. But the current situation is not what is planned.
I’ve had to do some encoding work for work in the last two years, and
while I prefer a UTF-8/UTF-16 internal representation, I also know
that’s impossible in some situations and you have to be flexible. I
also know that POSIX handles this situation worse than any other
setup.

With the work that I’ve done on this, Matz is right about this, and
the people claiming that Unicode is the Only Way … are wrong. In an
ideal world, Unicode would be the correct and only way. In the real
world, however, it’s a lot messier, and Ruby has to be aware of that.

We can still make it as easy as possible for the common case (which
will be UTF-8 encoding data and filenames). But we shouldn’t make the
mistake of assuming that the common case is all that Ruby should handle.

  • There is only one internal string (where string means a
    string of characters) representation. String operations
    don’t need to be written for different encodings.

This is still (mostly) correct under the m17n String proposal.

  • No need for $KCODE.

This is true under the m17n String.

  • Higher abstraction.

This is true under the m17n String.

  • Separation of concerns. I always found it strange that most dynamic
    languages simply mix handling of character and arbitrary binary data
    (just think of pack/unpack).

The separation makes things harder most of the time.

  • Reading of character data in one encoding and representing it in
    other encoding(s) would be easy.

This is true under the m17n String.

It seems that the main argument against using Unicode strings in Ruby
is because Unicode doesn’t work well for eastern countries. Perhaps
there is another character set that works better that we could use
instead of Unicode. The important point here is that there is only
one representation of character data Ruby.

This is a mistake.

If Unicode is choosen as character set, there is the question which
encoding to use internally. UTF-32 would be a good choice with regards
to simplicity in implementation, since each codepoint takes a fixed
number of bytes. Consider indexing of Strings:

Yes, but this would be very hard on memory requirements. There are
people who are trying to get Ruby to fit into small-memory environments.
This would destroy any chance of that.

[…]

Thank you for reading so far. Just in case Matz decides to implement
something similar to this proposal, I am willing to help with Ruby
development (although I don’t know much about Ruby’s internals and not
too much about Unicode either).

I would suggest that you look for discussions about m17n Strings in
Ruby. Matz has this one right.

I do not have a CS degree and I’m not a Unicode expert, so perhaps the
proposal is garbage, in this case please tell me what is wrong about
it or why it is not realistic to implement it.

I don’t have a CS degree either, but I have been in the business for a
long time and I’ve been immersed in Unicode and encoding issues for
the last two years. If everyone used Unicode – and POSIX weren’t stupid
– your proposal would be much more realistic. I agree that Ruby
should encourage the use of Unicode as much as is practical. But it also
shouldn’t tie our hands like other programming languages do.

-austin

On 6/17/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

(AFAIK) and you don’t get casefolding. Case-insensitive search/replace
quickly becomes bondage.

I don’t disagree. But you’re not going to get those features, in all
likelihood, in a Ruby 1.8.x release. It would be a breaking release.
Oniguruma is the default for Ruby 1.9+. If there are things missing,
work with the developer.

I am maintaining a gem whose test fails due to different regexps in
Oniguruma, but I would be able to quickly fix it knowing that
Oniguruma is in stable now.

I don’t think that Oniguruma is in stable (1.8.x); I don’t think it
will be enabled as default in stable. Again, it’s a breaking change.

  1. Be flexible.
    And little is more flexible than Matz’s m17n String.
    I couldn’t find a proper description of that - as I told already, the
    thing I’d least prefer would be

get a string from the database

p str + my_unicode_chars # Ok, bail out with an ugly exception
because the author of the DB adaptor didn’t care to send me proper
Strings…

The DB adaptor, of course, will have to look at the encoding that the DB
is using.

p mojikyo_str + my_unicode_chars # who wins?

or (especially)

p mojikyo_str +
bytestring_that_i_just_grabbed_by_http_and_i_know_it_is_mojikyo_but_its_
not # who wins?

Consider coersion in Numerics (ri Numeric#coerce). A similar framework
can be built for Strings.

-austin

On 17/06/06, Austin Z. [email protected] wrote:

  • This ties Ruby’s String to Unicode. A safe choice IMHO, or would we
    really consider something else? Note that we don’t commit to a
    particular encoding of Unicode strongly.

This is a wash. I think that it’s better to leave the options open.
After all, it is a hope of mine to have Ruby running on iSeries
(AS/400) and that still uses EBCDIC.

Not to mention that Matz has explicitly stated in the past that he
wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
aren’t compatible with a Unicode internal representation.

Not tying String to Unicode is also the right thing to do: it allows
for future developments. Java’s weird encoding system is entirely down
to the fact that it standardised on UCS-2; when codepoints beyond
65535 arrived, they had to be shoehorned in via an ugly hack. As far
as possible, Ruby should avoid that trap.

Paul.