Upcase in 1.9.2-preview1

Can someone confirm whether this is intentional or not?

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”

s = “über”
=> “über”

s.upcase
=> “üBER”

That is, a lower-case “ü” is not uppercased to “Ãœ”. And yet, “Ãœ” is
detected as an upper-case letter:

“ÃœBER” =~ /[[:upper:]]/
=> 0

“ÃœBER” =~ /[[:lower:]]/
=> nil

Thanks,

Brian.

Hi,

Am Mittwoch, 29. Jul 2009, 16:43:36 +0900 schrieb Brian C.:

detected as an upper-case letter:

“ÃœBER” =~ /[[:upper:]]/
=> 0
“ÃœBER” =~ /[[:lower:]]/
=> nil

By the way: I detected that there are some unicode characters where
it is not clear whether they are up- or downcase. For example the
DZ digraph has a version Dz that is the downcased version of DZ
and the upcased version of dz.

U+01F1 DZ
U+01F2 Dz
U+01F3 dz

Latin Extended-B - Wikipedia

Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?

Bertram

Bertram S. wrote:

Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?

I only have access to a slightly older 1.9.2 here, but:

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-04-08 trunk 23158) [i686-linux]”

["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:lower:]]/ }

0
=> [“DZ”, “Dz”, “dz”]

["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:upper:]]/ }
0

=> [“DZ”, “Dz”, “dz”]

So the first is upper, the third is lower, and the second is neither :slight_smile:

upcase/downcase does not affect any of them - but I’m not sure if the
current behaviour is correct, which is why I started this thread.

2009/7/29 Brian C. [email protected]:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it’s very clear that the upcase version of á is Á.

Brian C. wrote:

Can someone confirm whether this is intentional or not?

s = “über”
=> “über”

s.upcase
=> “üBER”

To answer my own question: looking at the source code, it looks like
this is intentional. From encoding.c:

int
rb_enc_toupper(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_UPPER_CASE©:©);
}

int
rb_enc_tolower(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_LOWER_CASE©:©);
}

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Iñaki Baz C. wrote:

2009/7/29 Brian C. [email protected]:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it’s very clear that the upcase version of á is Á.

There are perfectly clear Unicode rules for case conversion, but they
are not simple. In some cases you need to replace one character by two
(e.g. ß to SS)

There is a useful discussion about this from Python’s point of view
here:

Hi,

In message “Re: upcase in 1.9.2-preview1”
on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz C. [email protected]
writes:

|> That is: only ASCII characters (potentially encoded as UTF16 or
|> whatever) are eligible for case conversion.
|
|Is it the correct approach?
|For me it’s very clear that the upcase version of á is Á.

But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

          matz.

Yukihiro M. wrote:

But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

99% of the time it’s locale independant. I don’t follow this logic of
“if we can’t make it work for everyone then it should stay broken for
everyone”. Following the Unicode rules would fix uppercasing for 95% of
those not using english. And for the 5% of those who have to deal with
those asymmetric upper/lower case rules… well, they’d have to deal
with it either way.

*percentages above are guesswork