Re: Unicode roadmap?

dimas · June 14, 2006, 10:20am

Almost all typical tasks on Unicode can be handled with UTF8 support in
Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in
unicode_hack[2])
(but case-insensitive regexp don’t work for non ASCII chars in Ruby 1.8,
that can be probably solved using latest Oniguruma).

But if you’re looking for deeper level of “Unicode support”, e.g. as
described in Unicode FAQ[3], those problems aren’t about handling
Unicode
per se, but are rather L10N/I18N problems, such as locale dependent text
breaks,collation, formatting etc.
To deal with them from Ruby take look at somewhat broken wrappers to
ICU
library icu4r[4], g11n[5] and Ruby/CLDR[6].

And if you want Unicode as default String encoding and want to use
national
chars in names for your vars/functions/classes in Ruby code, I believe,
it
will never happen.

Links:
[1] http://www.yoshidam.net/Ruby.html
[2] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
[3] FAQ - Basic Questions
[4] http://rubyforge.org/projects/icu4r
[5] http://rubyforge.org/projects/g11n
[6] http://www.yotabanana.com/hiki/ruby-cldr.html

dimas · June 14, 2006, 10:27am

From: Dmitry S. [mailto:[email protected]]
Sent: Wednesday, June 14, 2006 11:20 AM

described in Unicode FAQ[3], those problems aren’t about handling Unicode
per se, but are rather L10N/I18N problems, such as locale dependent text
breaks,collation, formatting etc.
To deal with them from Ruby take look at somewhat broken wrappers to ICU
library icu4r[4], g11n[5] and Ruby/CLDR[6].

Thanks Dmitry!

And if you want Unicode as default String encoding and want to use
national
chars in names for your vars/functions/classes in Ruby code, I believe, it
will never happen.

Hmmm… I’ve think Unicode IS defaul String encoding when $KCODE=u
Not?

V.

dimas · June 14, 2006, 10:40am

On 6/14/06, Victor S. [email protected] wrote:

Hmmm… I’ve think Unicode IS defaul String encoding when $KCODE=u
Not?

No. Current String implementation has no notion of “encoding” (Ruby
String
is just a sequence of bytes) and $KCODE is just a hint for methods to
change
their behaviour (e.g. in Regexp) and treat those bytes as text
represented
in some encoding.

dimas · June 14, 2006, 10:59am

Matz,

thanks for taking part in that discussion. I would really appreciate an
elegant unicode solution from the master himself in ruby (and
probably all other non-us citizens)

In most cases I would be happy if at least this functions
of class String had an unicode equivalent.

capitalize
upcase
downcase
reverse
slice
split
index

Maybe it’s because I am no guru of regexp but I can’t imagine a trivial
solution.

Another issue is that ActiveRecord (and other additional libraries)
are not unicode aware because there is no transparent unicode support.

Just as an example,

functions like:

ActiveRecord::Validations::ClassMethods::validates_length_of

using parameters like

minimum - The minimum size of the attribute

maximum - The maximum size of the attribute

will most probably use String.size which is giving the byte length,
not the string length.

The ruby 2.0 solution I read about (each string carries it’s encoding
inside) sounds fantastic (not to mention bytecode execution). Could you
imagine an implementation of that before ruby 2.0 ?

Best regards
Peter

-------- Original-Nachricht --------
Datum: Wed, 14 Jun 2006 17:38:40 +0900
Von: Dmitry S. [email protected]
An: [email protected]
Betreff: Re: Unicode roadmap?

dimas · June 14, 2006, 11:37pm

On Wednesday 14 June 2006 06:01 am, Juergen S. wrote:

For my personal vision of “proper” Unicode support, I’d like to have
UTF-8 the standard internal string format, and Unicode Points the
standard character code, and all String functions to just work
intuitively “right” on a character base rather than byte base. Thus
the internal String encoding is a technical matter only, as long as it
is capable of supporting all Unicode characters, and these internal
details are not exposed via public methods.

Maybe Juergen is saying the same thing I’m going to say, but since I
don’t
understand / recall what UTF-8 encoding is exactly:

I’m beginning to think (with a newbie sort of perspective) that Unicode
is too
complicated to deal with inside a program. My suggestion would be that
Unicode be an external format…

What I mean is, when you have a program that must handle international
text,
convert the Unicode to a fixed width representation for use by the
program.
Do the processing based on these fixed width characters. When it’s
complete,
convert it back to Unicode for output.

It seems to me that would make a lot of things easier.

Then I might have two basic “types” of programs–programs that can
handle any
text (i.e., international), and other programs that can handle only
English
(or maybe only European languages that can work with an 8 bit byte). (I
suggest these two types of programs because I suspect those that have to
handle the international character set will be slower than those that
don’t.)

Aside: What would that take to handle all the characters / ideographs
(is that
what they call them, the Japanese, Chinese, … characters) presently in
use
in the world–iirc, 16 bits (2**16) didn’t cut it for Unicode–would 32
bits?

Randy K.

dimas · June 14, 2006, 12:03pm

On Wed, Jun 14, 2006 at 05:26:58PM +0900, Victor S. wrote:

national
chars in names for your vars/functions/classes in Ruby code, I believe, it
will never happen.

Hmmm… I’ve think Unicode IS defaul String encoding when $KCODE=u
Not?

V.

Strictly speaking, Unicode is not an encoding, but UTF-8 is.

For my personal vision of “proper” Unicode support, I’d like to have
UTF-8 the standard internal string format, and Unicode Points the
standard character code, and all String functions to just work
intuitively “right” on a character base rather than byte base. Thus
the internal String encoding is a technical matter only, as long as it
is capable of supporting all Unicode characters, and these internal
details are not exposed via public methods.

I/O and String functions should be able to convert to and from
different external encodings, via plugin modules. Note I don’t require
non Unicode String classes, just the possibility to do I/O with
foreign characters sets, or conversion to byte arrays. Strings should
consist of characters, not just be a sequence of bytes meaningless
without external information about their encoding.

No ruby apps or libraries should break because they are surprised by
(Unicode) Strings, or it should be obvious the fault is with them.

Optionally, additional String classes with different internal Unicode
encodings might be a boon for certain performance sensitive
applications, and they should all work together much like Numbers of
different kinds do.

While I want ruby source files to be UTF-8 encoded, in no way do I
want identifiers to consist of additional national characters. I like
names in APIs everyone can actually type, but literal Strings is a
different matter.

I know this is a bit vague on the one hand, and might demand intrusive
changes on the other one. Java history shows proper Unicode support
is no trivial matter, and I don’t feel qualified to give advice how to
implement this. It’s just my vision of how Strings ideally would be.

And of course for my personal vision to become perfect, everyone
outside Ruby should adopt Unicode too.

JÃ¼rgen

dimas · June 15, 2006, 8:39pm

On 6/15/06, Juergen S. [email protected] wrote:
[ snip essentially accurate information ]

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes.

It could be up to six bytes at one point. However, I think that there
is still support for surrogate characters meaning that a single glyph
might take as many as eight bytes to represent in the 1-4 byte
representation. Even with that, though, those are rare and usually
user-defined (private) ranges IIRC. This also doesn’t deal with
(de)composed glyphs/combining glyphs.

Currently Unicode requires 21 bit, but this has changed in the past.

Yes. Unicode went from 16-bit (I think) to 32-bit to 21-bit.

Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

Um. I think that the initial Java definition used UCS-2 (same as
Windows did for NTFS and VFS) but now uses UTF-16, which has surrogate
support (UCS-2 did not).

-austin

dimas · June 15, 2006, 8:20pm

On Thu, Jun 15, 2006 at 06:34:11AM +0900, Randy K. wrote:

understand / recall what UTF-8 encoding is exactly:
Wikipedia has decent articles on unicode at
Unicode - Wikipedia.

Basically, Unicode gives every character worldwide a unique number,
called code point. Since this numbers can be quite large (currently
up to 21 bit), and especially western users usually only use a tiny
subset, different encoding were created to save space, or remain
backward compatible with 7 bit ASCII.

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes. Most western symbols only require 1 or 2
bytes. This encoding is space efficient, and ASCII compatible as long
as only 7 bit characters are used. Certain string operation
are quite hard or inefficient, since the position of characters, or
even the length of a string, given a byte stream, is uncertain without
counting actual characters (no pointer/index arithmetic!).

UTF-32 encodes every code point as a single 32 bit word. This enables
simple, efficient substring access, but wastes space.

Other encodings have yet different characteristics, but all deal with
encoding the same code points. A Unicode String class should expose
code points, or sequences of code points (characters), not the
internal encoding used to store them and that is the core of my
argument.

I’m beginning to think (with a newbie sort of perspective) that Unicode is too
complicated to deal with inside a program. My suggestion would be that
Unicode be an external format…

What I mean is, when you have a program that must handle international text,
convert the Unicode to a fixed width representation for use by the program.
Do the processing based on these fixed width characters. When it’s complete,
convert it back to Unicode for output.

UTF-32 would be such an encoding. It uses quadruple space for simple 7
bit ASCII characters, but with such a dramatically larger total
character set, some tradeoffs are unavoidable.

in the world–iirc, 16 bits (2**16) didn’t cut it for Unicode–would 32 bits?
Randy K.

Currently Unicode requires 21 bit, but this has changed in the past.
Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

A split of simple and Unicode-aware will divide code into two camps,
which will remain slightly incompatible or require dirty hacks. I’d
rather prolonge the status quo, where Strings can be seen to contain
bytes in whatever encoding the user sees fit, but might break if used
with foreign code which has other notions of encoding.

JÃ¼rgen

dimas · June 16, 2006, 12:36am

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Z. wrote:

user-defined (private) ranges IIRC. This also doesn’t deal with
(de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

support (UCS-2 did not).

-austin

Java has its own character type apart from string. Like C’s char,
only it is 16 bits wide, and is not directly related to internal
string encoding. Note that Java strings are more than a simple
sequence of objects of character type. And 16 bits is not enough for
some Unicode characters, which leads to the weird situation of needing
two character objects to represent a single character sometimes (via
surrogates).

JÃ¼rgen

dimas · June 16, 2006, 10:32am

On 6/16/06, Juergen S. [email protected] wrote:

representation. Even with that, though, those are rare and usually
user-defined (private) ranges IIRC. This also doesn’t deal with
(de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

Please, do not use Wikipedia as an argument. It can contain useful
information but it may as well contain utter nonsense. I may just go
there and change that 4 to 32. Maybe somebody will notice and correct
it, maybe not. You never know.
When reading anything on wikipedia you should verify from other
sources. It applies to other webs as well. But with wikipedia you have
no clue who wrote it.

If you want to get more idea about the quality of some wikipedia
articles search for wikipedia and Seigenthaler in your favorite search
engine (preferrably non-Google :).

One of the many results returned:
http://www.usatoday.com/news/opinion/editorials/2005-11-29-wikipedia-edit_x.htm

Thanks

Michal

dimas · June 17, 2006, 9:54am

On Fri, Jun 16, 2006 at 05:27:04PM +0900, Paul B. wrote:

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
sequences.

Compare RFC 2279 from 1998 (six bytes)
RFC 2279 - UTF-8, a transformation format of ISO 10646
and RFC 3629 from 2003 (four bytes)
RFC 3629 - UTF-8, a transformation format of ISO 10646

I don’t care who is technically correct here, that’s not the point.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs. We might take care to be open to future
changes alongside old ones, but that’s hard to predict and I wouldn’t
waste time guessing. And Ruby is much more dynamic and less vulnerable
to such changes as for example Java.

JÃ¼rgen

dimas · June 17, 2006, 11:06am

On 17/06/06, Juergen S. [email protected] wrote:

I don’t care who is technically correct here, that’s not the point.

On the contrary: it’s exactly the point in a technical discussion of
the number of bytes taken by various encodings.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs.

No one suggested supporting obsolete RFCs. I compared the obsolete and
current RFCs precisely so that everyone could get a clearer idea of
what constitutes the current state of UTF-8 - which is what we should
support. I hope you agree that they are more reliable sources for
technical information than Wikipedia.

Paul.

dimas · June 16, 2006, 10:29am

On 15/06/06, Juergen S. [email protected] wrote:

On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Z. wrote:
…

It could be up to six bytes at one point. However, I think that there
is still support for surrogate characters meaning that a single glyph
might take as many as eight bytes to represent in the 1-4 byte
representation. Even with that, though, those are rare and usually
user-defined (private) ranges IIRC. This also doesn’t deal with
(de)composed glyphs/combining glyphs.

No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
all characters. Only Java may need more than that because of their use
of UTF16 surrogates and special \0 handling in an intermediary step. See

Austin’s correct about six bytes, actually. The original UTF-8
specification was for up to six bytes:
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

However, no codepoints were ever defined in the upper part of the
range, and once Unicode was officially restricted to the range
1-0x10FFFF, there was no longer any need for the five- and six-byte
sequences.

Compare RFC 2279 from 1998 (six bytes)

and RFC 3629 from 2003 (four bytes)

That Java encoding (UTF-8-encoded UTF-16) isn’t really UTF-8, though,
so you’d never get eight bytes in valid UTF-8:

The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters. (RFC 3629)

Paul.

dimas · June 17, 2006, 11:18am

On 6/16/06, Juergen S. [email protected] wrote:

representation. Even with that, though, those are rare and usually

Currently Unicode requires 21 bit, but this has changed in the past.

Yes. Unicode went from 16-bit (I think) to 32-bit to 21-bit.

Well, there is the official http://unicode.org/ site no one has
mentioned so far.

There’s all sorts of technical information on Unicode.
http://www.unicode.org/reports/index.html

Including the latest version:
http://www.unicode.org/versions/Unicode4.1.0/

dimas · June 17, 2006, 12:38pm

On Sat, Jun 17, 2006 at 06:05:20PM +0900, Paul B. wrote:

On 17/06/06, Juergen S. [email protected] wrote:

I don’t care who is technically correct here, that’s not the point.

On the contrary: it’s exactly the point in a technical discussion of
the number of bytes taken by various encodings.

The discussion is about a Unicode Roadmap for Ruby. The number of
bytes per UTF-8 encoded character is tangential to this.

Paul.

If you can point to an official and current standard which proves me
false an my statement of 1-4 bytes per plain UTF-8 encoded character,
I’ll concede my point. Please don’t bring in combining characters,
you know what I mean by now. If you like, s/character/code point/g.

Merely pointing to a concise and actually well written Wikipedia
article with a nice summary is way more informative than using
obsoleted RFCs to reinforce one’s own argument. Besides, we all know
how relyable Wikipedia is. End of discussion from my side.

JÃ¼rgen

dimas · June 17, 2006, 2:31pm

If you can point to an official and current standard which proves me
false an my statement of 1-4 bytes per plain UTF-8 encoded character,
I’ll concede my point. Please don’t bring in combining characters,
you know what I mean by now. If you like, s/character/code point/g.

You are correct. And why Wikipedia? www.unicode.org has it all:
UTR#17: Unicode Character Encoding Model and
following

dimas · June 18, 2006, 6:07am

Um, hi everyone. I’m a Rubie newby but very, very old hand at
Unicode & text processing. I wrote all those articles Charles Nutter
pointed to the other day. I spent years doing full-text search for
a living, and adapted a popular engine to handle Japanese text, and
co-edited the XML spec and helped work out its character-encoding
issues. Lots more war stories on request.

Anyhow, I have some ideas about what good ways to do text processing
in a language like Ruby might be, but I thought for the moment I’d
just watch this interesting debate go by and serve as an information
resource.

On Jun 15, 2006, at 11:17 AM, Juergen S. wrote:

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes.

UTF-8 can do the 1,114,112 Unicode codepoints in 4 bytes. We
probably don’t need any more codepoints until we meet alien
civilizations.

Most western symbols only require 1 or 2
bytes. This encoding is space efficient

UTF-8 is racist. The further East you go, the less efficient it is
to store text. Having said that, it has a lot of other advantages.
Also, when almost every storage device is increasingly being used for
audio and video, at megabytes per minute, it may be the case that the
efficiency of text storage is less likely to be a bottleneck.

Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

Java screwed up, with the result that a Java (and C#) “char”
represent a UTF-16 codepoint. Blecch.

-Tim

dimas · June 17, 2006, 11:37am

On 17/06/06, Dmitrii D. [email protected] wrote:

Well, there is the official http://unicode.org/ site no one has
mentioned so far.

There’s all sorts of technical information on Unicode.
Technical Reports

Including the latest version: Unicode 4.1.0

Good point. Unfortunately, a lot of it is only available as PDFs of
each chapter. The bookmarks help:
http://www.unicode.org/versions/Unicode4.0.0/bookmarks.html

The technical reports are really useful:
http://www.unicode.org/reports/index.html

Paul.