String#upcase/downcase with UTF-8 strings in Ruby 1.9

stefans · July 10, 2008, 12:14am

Hello,

in Ruby 1.9 I get the following behaviour:

“aoueäöüé”.upcase
=>
“AOUEäöüé”>> “AOUEÄÖÜÉ”.downcase
=> “aoueÄÖÜÉ”

I can’t find however find a bug in the bug tracking system.
Doesn’t this qualify as a bug?

Cheers, Stefan

stefans · July 10, 2008, 1:34am

Hi,

In message “Re: String#upcase/downcase with UTF-8 strings in Ruby 1.9”
on Thu, 10 Jul 2008 07:09:29 +0900, “Stefan S.”
[email protected] writes:

The document for String#upcase says:

call-seq:
str.upcase => new_str

Returns a copy of str with all lowercase letters replaced with
their
uppercase counterparts. The operation is locale insensitive—only
characters a'' to z’’ are affected.
Note: case replacement is effective only in ASCII region.

 "hEllO".upcase   #=> "HELLO"

See “Note:”. Tim B. have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

          matz.

stefans · July 10, 2008, 3:22am

The document for String#upcase says:

Yes, sorry, I should have read the documentation

See “Note:”. Tim B. have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

So basically the Python guys are going down a wrong route ?

-- coding: utf-8 --

import string
print string.upper(u"aoueäöüé")
print string.lower(u"AOUEÄÖÜÉ")

works as expected.

Cheers, Stefan

stefans · July 10, 2008, 3:30am

On Jul 9, 2008, at 8:17 PM, Stefan S. wrote:

-- coding: utf-8 --

import string
print string.upper(u"aoueäöüé")
print string.lower(u"AOUEÄÖÜÉ")

works as expected.

Cheers, Stefan

No.
They’re going down a different route.
Seriously, the language handling is something that could easily be
handled by extensions. It does not need to be a core part of the
language.
Even operating systems handle these things with proprietary and very
sophisticated techniques based on the language in question.
In most cases, what you are expecting to be the correct upper case
characters may be ‘correct’ but it will ultimately depend on the
language and the context.

stefans · July 10, 2008, 2:24am

On Jul 9, 2008, at 6:25 PM, Yukihiro M. wrote:

|=> “aoueÄÖÜÉ”
with their
matz.

This leaves the perfect opening for people to contribute locale or
language specific extensions to String.
It would make a great gem with a plug-in architecture.
Just add options for the language you want to use.
In any case it can get very tricky to do character conversions with
different languages.

stefans · July 10, 2008, 5:43pm

No.
They’re going down a different route.
Seriously, the language handling is something that could easily be
handled by extensions. It does not need to be a core part of the
language.

Is Nikolai W.'s Ruby Character Encodings Library [1] currently the
best way to go?

Stefan

[1] http://bitwi.se/software/ruby/character-encodings/

stefans · July 11, 2008, 7:34am

Seriously, the language handling is something that could easily be
handled by extensions. It does not need to be a core part of the
language.

Are there any working extensions for Ruby 1.9 that offer Unicode support
for String#downcase/upcase and/or Array#sort?

Stefan