Upcase in 1.9.2-preview1

stickstone · July 29, 2009, 9:41am

Can someone confirm whether this is intentional or not?

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”

s = “Ã¼ber”
=> “Ã¼ber”

s.upcase
=> “Ã¼BER”

That is, a lower-case “Ã¼” is not uppercased to “Ãœ”. And yet, “Ãœ” is
detected as an upper-case letter:

“ÃœBER” =~ /[[:upper:]]/
=> 0

“ÃœBER” =~ /[[:lower:]]/
=> nil

Thanks,

Brian.

stickstone · July 29, 2009, 12:08pm

Hi,

Am Mittwoch, 29. Jul 2009, 16:43:36 +0900 schrieb Brian C.:

detected as an upper-case letter:

“ÃœBER” =~ /[[:upper:]]/
=> 0
“ÃœBER” =~ /[[:lower:]]/
=> nil

By the way: I detected that there are some unicode characters where
it is not clear whether they are up- or downcase. For example the
DZ digraph has a version Dz that is the downcased version of DZ
and the upcased version of dz.

U+01F1 Ç±
U+01F2 Ç²
U+01F3 Ç³

Latin Extended-B - Wikipedia

Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?

Bertram

stickstone · July 29, 2009, 12:21pm

Bertram S. wrote:

Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?

I only have access to a slightly older 1.9.2 here, but:

RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-04-08 trunk 23158) [i686-linux]”

["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:lower:]]/ }

0
=> [“Ç±”, “Ç²”, “Ç³”]

["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:upper:]]/ }
0

=> [“Ç±”, “Ç²”, “Ç³”]

So the first is upper, the third is lower, and the second is neither

upcase/downcase does not affect any of them - but I’m not sure if the
current behaviour is correct, which is why I started this thread.

stickstone · July 29, 2009, 7:14pm

2009/7/29 Brian C. [email protected]:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it’s very clear that the upcase version of Ã¡ is Ã.

stickstone · July 29, 2009, 12:32pm

Brian C. wrote:

Can someone confirm whether this is intentional or not?

s = “Ã¼ber”
=> “Ã¼ber”

s.upcase
=> “Ã¼BER”

To answer my own question: looking at the source code, it looks like
this is intentional. From encoding.c:

int
rb_enc_toupper(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_UPPER_CASE©:©);
}

int
rb_enc_tolower(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_LOWER_CASE©:©);
}

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

stickstone · July 29, 2009, 7:28pm

IÃ±aki Baz C. wrote:

2009/7/29 Brian C. [email protected]:

That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.

Is it the correct approach?
For me it’s very clear that the upcase version of Ã¡ is Ã.

There are perfectly clear Unicode rules for case conversion, but they
are not simple. In some cases you need to replace one character by two
(e.g. ÃŸ to SS)

There is a useful discussion about this from Python’s point of view
here:

stickstone · July 29, 2009, 7:33pm

Hi,

In message “Re: upcase in 1.9.2-preview1”
on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz C. [email protected]
writes:

|> That is: only ASCII characters (potentially encoded as UTF16 or
|> whatever) are eligible for case conversion.
|
|Is it the correct approach?
|For me it’s very clear that the upcase version of á is Á.

But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

          matz.

stickstone · August 3, 2009, 2:54pm

Yukihiro M. wrote:

But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.

99% of the time it’s locale independant. I don’t follow this logic of
“if we can’t make it work for everyone then it should stay broken for
everyone”. Following the Unicode rules would fix uppercasing for 95% of
those not using english. And for the 5% of those who have to deal with
those asymmetric upper/lower case rules… well, they’d have to deal
with it either way.

*percentages above are guesswork