Can someone confirm whether this is intentional or not?
RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]”
s = “über”
=> “über”
s.upcase
=> “üBER”
That is, a lower-case “ü” is not uppercased to “Ãœ”. And yet, “Ãœ” is
detected as an upper-case letter:
“ÃœBER” =~ /[[:upper:]]/
=> 0
“ÃœBER” =~ /[[:lower:]]/
=> nil
Thanks,
Brian.
Hi,
Am Mittwoch, 29. Jul 2009, 16:43:36 +0900 schrieb Brian C.:
detected as an upper-case letter:
“ÃœBER” =~ /[[:upper:]]/
=> 0
“ÃœBER” =~ /[[:lower:]]/
=> nil
By the way: I detected that there are some unicode characters where
it is not clear whether they are up- or downcase. For example the
DZ digraph has a version Dz that is the downcased version of DZ
and the upcased version of dz.
U+01F1 DZ
U+01F2 Dz
U+01F3 dz
Latin Extended-B - Wikipedia
Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?
Bertram
Bertram S. wrote:
Vim’s ~ operator cycles through the three values. How will or
should Ruby treat them?
I only have access to a slightly older 1.9.2 here, but:
RUBY_DESCRIPTION
=> “ruby 1.9.2dev (2009-04-08 trunk 23158) [i686-linux]”
["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:lower:]]/ }
0
=> [“DZ”, “Dz”, “dz”]
["\u01f1", “\u01f2”, “\u01f3”].each { |c| puts c =~ /[[:upper:]]/ }
0
=> [“DZ”, “Dz”, “dz”]
So the first is upper, the third is lower, and the second is neither
upcase/downcase does not affect any of them - but I’m not sure if the
current behaviour is correct, which is why I started this thread.
2009/7/29 Brian C. [email protected]:
That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.
Is it the correct approach?
For me it’s very clear that the upcase version of á is Ã.
Brian C. wrote:
Can someone confirm whether this is intentional or not?
s = “über”
=> “über”
s.upcase
=> “üBER”
To answer my own question: looking at the source code, it looks like
this is intentional. From encoding.c:
int
rb_enc_toupper(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_UPPER_CASE©:©);
}
int
rb_enc_tolower(int c, rb_encoding *enc)
{
return
(ONIGENC_IS_ASCII_CODE©?ONIGENC_ASCII_CODE_TO_LOWER_CASE©:©);
}
That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.
Iñaki Baz C. wrote:
2009/7/29 Brian C. [email protected]:
That is: only ASCII characters (potentially encoded as UTF16 or
whatever) are eligible for case conversion.
Is it the correct approach?
For me it’s very clear that the upcase version of á is Ã.
There are perfectly clear Unicode rules for case conversion, but they
are not simple. In some cases you need to replace one character by two
(e.g. ß to SS)
There is a useful discussion about this from Python’s point of view
here:
Hi,
In message “Re: upcase in 1.9.2-preview1”
on Thu, 30 Jul 2009 02:13:41 +0900, Iñaki Baz C. [email protected]
writes:
|> That is: only ASCII characters (potentially encoded as UTF16 or
|> whatever) are eligible for case conversion.
|
|Is it the correct approach?
|For me it’s very clear that the upcase version of á is Á.
But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.
matz.
Yukihiro M. wrote:
But it’s locale dependent. In some languages, upper/lower case
conversion is not one-to-one mapping.
99% of the time it’s locale independant. I don’t follow this logic of
“if we can’t make it work for everyone then it should stay broken for
everyone”. Following the Unicode rules would fix uppercasing for 95% of
those not using english. And for the 5% of those who have to deal with
those asymmetric upper/lower case rules… well, they’d have to deal
with it either way.
*percentages above are guesswork