I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(サービス終了のお知らせ), it should match
everything from the unicode “letter” category, which includes umlauts.
test.rb (also attached):
encoding: utf-8
$KCODE=‘u’
s = “ü”
puts s.match(/\w/u).inspect
Result with ruby 1.8:
#<MatchData “ü”>
Result with ruby 1.9.2:
nil
Is that a bug, or is there any reason behind this behavior?
After some more googleing I found this bug report (don’t know why I
didn’t catch it earlier) that states that this is the desired behavior: http://redmine.ruby-lang.org/issues/show/3181
Still, I don’t understand the motivation for making this change.
“Basically at a certain patch level of 1.9.1, \w was set to no longer
match unicode characters, because the core developers were concerned
that this was not what people expected from \w.”
Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.
Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this (“word”
it’s not a unicode character category).
I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(サービス終了のお知らせ), it should match
everything from the unicode “letter” category, which includes umlauts.
so it’s intended, however if you extremely dislike this, then complain
about it since apparently it’s surprising to a number of people
Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.
yeah. 1.9.1 behaving differently with a different patch level is less
than expected, too.
Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this (“word”
it’s not a unicode character category).
That’s odd that there’s no standard. Maybe ruby made this up on their
own, then?
Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this (“word”
it’s not a unicode character category).