State of unicode support

cpeterson · July 28, 2006, 5:02pm

I’ve heard rumors that “oniguruma fixes everything”, and the like. I’m
sure that’s a touch of hyperbole, but in any case:

What’s the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.

cpeterson · July 28, 2006, 6:10pm

Oh man, I really don’t have the energy for this thread again Chad: if
you
get a straight answer about this, let me know. Others: Is there a
simple,
straightforward FAQ entry somewhere that says “to use Unicode you have
the
following choices”? This keeps coming up.

cpeterson · July 28, 2006, 9:15pm

On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:

Oh man, I really don’t have the energy for this thread again Chad: if you
get a straight answer about this, let me know. Others: Is there a simple,
straightforward FAQ entry somewhere that says “to use Unicode you have the
following choices”? This keeps coming up.

This isn’t a complete answer, but it’s the best I can do to help Chad
out.
If you really want to solve the question now, Chad, I’d read Julian
Tarkhanov’s
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode
regular
expressions under many encodings, it’s very handy. If all you want to
do is
search strings for Unicode text, then great, use it.

Ruby’s strings are not unicode-aware. There is a library called
‘jcode’, which
comes with Ruby which tries to help out, but it’s very simple, only good
for a
few things like counting characters and iterating through characters.
Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree.
Using the
‘u’ modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string
containing a
multibyte character. (Also: str.unpack(‘U*’).)

If you are using Unicode strings in Rails, check out Julian’s
unicode_hacks
plugin:
http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of
several
Ruby unicode extensions before falling back to str.unpack(‘U*’) mode.

Here are the extensions it prefers, in order:

icu4r: a Ruby extension to IBM’s ICU library. Adds UString, URegexp,
etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
utf8proc: a small library for iterating through characters and
converting
ints to code points. Adds String#utf8map and Integer#utf8, for
example.
(download[5])
unicode: a little extension by Yoshida Masato which adds Unicode class
methods for strcmp, [de]compose, normalization and case conversion
for
utf-8.
(download[6] and readme[7])

So, many options, some massive, but most only partial and in their
infancy.

The most recent entrant into this race, though, is Nikolai W.'s
ruby-character-encoding library, which aims to get complete multibyte
support
into Ruby 1.8’s string class. If you use it, it will probably break a
lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it’s better than spending
all day
Googling and extracting tarballs and pouring through READMEs just to get
a
picture of what’s happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1]

[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] flexiguided.de
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

cpeterson · July 28, 2006, 9:39pm

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can’t be fixed, but merely hacked
or worked around).

Cheers, folks; remember to be nice. We’re on the same team.

M.T.

cpeterson · July 28, 2006, 10:20pm

Very nice; it should be on a wiki somewhere under the bold, flashing
headline “WHAT’S UP WITH UNICODE IN RUBY”. Thank you!

cpeterson · July 28, 2006, 11:03pm

Spectacular summary. As a lurker on this thread,
I greatly appreciate it.

cpeterson · July 28, 2006, 9:24pm

On Sat, Jul 29, 2006 at 04:13:04AM +0900, why the lucky stiff wrote:

This might be a landslide of information, but it’s better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what’s happening these days.

That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.

Signed in elaborate calligraphy with a picture of grapes at the end,

. . . and as always, you manage to entertain in the process.

cpeterson · July 31, 2006, 12:31pm

On 7/28/06, Matt T. [email protected] wrote:

So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can’t be fixed, but merely hacked
or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
transparent unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal

cpeterson · July 31, 2006, 4:53pm

Tim B. wrote:

like \p{L}
Off topic, what does/would that do? Match a lower-case symbol?

cpeterson · July 31, 2006, 5:15pm

On Jul 31, 2006, at 7:52 AM, Alex Y. wrote:

First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it’s very handy. If all you
want to do is
search strings for Unicode text, then great, use it.
Er uh well it doesn’t do unicode properties so you can’t use
things like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

Unicode characters have named properties. “L” means it’s a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly
for those of us round-eye gringos who are prone to write [a-zA-Z] and
think we’re matching letters, which we’re not. If you don’t support
properties, you don’t support Unicode. -Tim

cpeterson · July 31, 2006, 5:25pm

On 28-jul-2006, at 21:13, why the lucky stiff wrote:

Ruby itself also understands UTF-8 regular expressions to a
degree. Using the
‘u’ modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string
containing a
multibyte character. (Also: str.unpack(‘U*’).)

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).

cpeterson · July 29, 2006, 10:36am

On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:

First, Onigurama[2] is a regular expression engine. It supports
Unicode regular
expressions under many encodings, it’s very handy. If all you want
to do is
search strings for Unicode text, then great, use it.

Er uh well it doesn’t do unicode properties so you can’t use things
like \p{L} which, once you’ve found them, quickly come to feel
essential. Anytime you write [a-zA-Z] in a regex, you’ve probably
just uttered a bug So I would say that Oniguruma has holes.

Otherwise, a very useful landslide indeed. -Tim

cpeterson · July 31, 2006, 5:38pm

On 31-jul-2006, at 17:10, Tim B. wrote:

Unicode characters have named properties. “L” means it’s a
letter. There are sub-properties like Lu and Ll for upper and
lower case. There are lots more properties for things like being
numbers, being white-space, combining forms and particular
properties of Asian characters and so on. Tremendously useful in
regexes, particularly for those of us round-eye gringos who are
prone to write [a-zA-Z] and think we’re matching letters, which
we’re not. If you don’t support properties, you don’t support
Unicode.

That’s one of the reasons why you need tables when working with
Unicode, and you will spend memory on them. What Ruby does now is
nowhere near, and Matz wrote that he didn’t unclude complete tables
for Oniguruma in 1.9 yet.

With proper regex support other funky things become posslbe, for
instance {all_cyrillic_letters} in a regex etc.

cpeterson · July 31, 2006, 5:28pm

Tim B. wrote:

we’re matching letters, which we’re not. If you don’t support
properties, you don’t support Unicode. -Tim

Gotcha. Thanks for that.

cpeterson · July 31, 2006, 6:06pm

On 31-jul-2006, at 17:48, Paul B. wrote:

Whilst it’s certainly useless for a lot of tasks, I’m not sure that
Ruby is any worse than other languages in this regard. As far as I’m
aware, most languages that ‘support’ Unicode don’t handle grapheme
clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

Actually, that’s a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

cpeterson · July 31, 2006, 5:51pm

On 31/07/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

on ruby-talk a while ago (with proper text boudnary handling).
Whilst it’s certainly useless for a lot of tasks, I’m not sure that
Ruby is any worse than other languages in this regard. As far as I’m
aware, most languages that ‘support’ Unicode don’t handle grapheme
clusters without using additional libraries.

I, for one, am very saddened every time the topic comes up ecause i’m
sick of the brokenness (I actually start looking at these Other
Languages and Other Frameworks that take l10n and i18n seriously).

Actually, that’s a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

Paul.

cpeterson · July 31, 2006, 6:54pm

On 31/07/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

On 31-jul-2006, at 17:48, Paul B. wrote:

Whilst it’s certainly useless for a lot of tasks, I’m not sure that
Ruby is any worse than other languages in this regard. As far as I’m
aware, most languages that ‘support’ Unicode don’t handle grapheme
clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

That’s what I mean: ICU is a separate library, not part of a language
core. We can use ICU in Ruby too - it’s still pre-alpha and not
seamless, but the possibility exists. From what I’ve read, Python
doesn’t do the heavyweight stuff natively, either. (Please tell me if
I’m wrong - I don’t use Python.)

Actually, that’s a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I’m not being sarcastic at all, though perhaps I could have phrased it
better. It’s just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we’d have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

I promised I’d try Thanks for the reminder, though! I’ll get on with
it.

Paul.

cpeterson · August 1, 2006, 11:09am

Paul B. wrote:

first-class examples of Doing It Right, I think we’d have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

I second that. I see a lot of people asking for “transparent” unicode
support
but I don’t see how that is possible. To me it’s like asking for a
language that
has transparent bug recovery. I know that ruby has weaknesses when it
comes to
multibyte encodings, but the main problem is human in nature; too many
people
assume that char==byte, which results in bugs when someone unexpectedly
uses
“weird” characters. IMHO no amount of “transparent support” will change
that.
But I would love to be shown otherwise with examples of languages that
“do it
right”.

Daniel

cpeterson · July 31, 2006, 7:17pm

On 31-jul-2006, at 18:51, Paul B. wrote:

free iterators and regexps).

That’s what I mean: ICU is a separate library, not part of a language
core.

PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.

We can use ICU in Ruby too - it’s still pre-alpha and not
seamless, but the possibility exists.

Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don’t do C.

From what I’ve read, Python
doesn’t do the heavyweight stuff natively, either. (Please tell me if
I’m wrong - I don’t use Python.)

It depends on what you call “heavyweight”. For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be “heavyweight”.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

I’m not being sarcastic at all, though perhaps I could have phrased it
better. It’s just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we’d have some useful
yardsticks.

The problem being, my “Right Examples” are nowhere near other’s
“Right Examples”, which in turn supurs flamewars.
My “right example” is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.

You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

ICU in all it’s incarnations (Java and C), compulsory character-
oriented Strings without choice of encoding in Java and the upcoming
Unicode support in Python (again - compulsory Unicode for all
strings, byte arrays for everything else). Perl’s regex support. I
know everyone will disagree (how do I match a PNG header in a
string???) but that’s what I consider good.

As to localization - resource bundles are good, and of course I
consider all languages that did bother to print localized dates.
Shame on Ruby.

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

I promised I’d try Thanks for the reminder, though! I’ll get on
with it.

Gotcha

cpeterson · August 1, 2006, 12:05pm

On 7/31/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

clusters without using additional libraries.
that I would want to applaud as a gesture
of sanity and common sense.

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat

It depends on what you call “heavyweight”. For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be “heavyweight”.

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it “heavyweight”. It would be a reason to start “optional
standard libraries” I guess

The problem being, my “Right Examples” are nowhere near other’s
“Right Examples”, which in turn supurs flamewars.
My “right example” is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.

It’s been also said that giving more options does not stop you from
using only unicode. If your “right example” is only about restricting
choice then there is really not much to it.

The “right examples” people were interested in are probably more like
the libraries/languages that implement enough functionality to give
you full unicode support for your definition of “full”.

Thanks

Michal