First question: Why does the final statement return 2 instead of nil?
All characters in foo are alphabetic characters in this locale.
Then:
$ echo $LANG
nl_NL
$ cat ./foo
#!/usr/bin/ruby -w
foo = “préférées”
p foo =~ /[^[:alnum:]]/
p foo =~ /\W/
$ ./foo
2
2
Huh?
Second question: Why does the first regex match now return 2 instead of
nil?
To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.
Second question: Why does the first regex match now return 2 instead of
nil?
To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.
What am I missing here?
Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB. Or your IRB belongs to a different Ruby version on
that system.
Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there. Maybe \w and \W only
treat ASCII [a-z] characters as word characters.
On Wed 14 Feb 2007 at 06:45:08 +0900, Robert K. wrote:
Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB.
Nope; I had hoped it would be that easy, but as you can see from my
snippet of output, I started irb with -f, which bypasses ~/.irbrc.
ENV[‘LANG’] also prints nl_NL in irb, so that can’t be it.
Or your IRB belongs to a different Ruby version on that system.
I compiled it myself, so there has been no mix-and-matching.
Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there. Maybe \w and \W only
treat ASCII [a-z] characters as word characters.
It does seem that way, as Perl also appears to treat them this way.
However, I’m still puzzled why there’s a difference between irb and a
stand-alone script.
Not exactly what you had but it probably has something to do with the
encoding of the é.
My editor is vim and I run it in the nl_NL locale, so it doesn’t start
in UTF-8 mode. To double-check:
:set encoding?
encoding=latin1
And if we dump my little script:
$ od -c foo
0000000 # ! / u s r / b i n / r u b y
0000020 - w \n \n f o o = " p r 351 f 351
0000040 r 351 e s " \n p f o o = ~ /
0000060 [ ^ [ : a l n u m : ] ] / \n p
0000100 f o o = ~ / \ W / \n
You can see that it is, indeed, saved as Latin-1, not UTF-8.
I’m beginning to wonder if the original question is even accurate.
Doing nothing more than changing the encoding and re-saving the file
(where the value for foo was a cut-n-paste from the email), there
doesn’t seem to be any discrpeancy between ruby and irb. (This
output is from ruby 1.8.5, but 1.8.2 was the same)
I beg your pardon. I must have had the locale set incorrectly on that
run. It runs as if typed interactively into irb:
$ irb
irb(main):001:0> load ‘foo’
nil
2
Phewsh. Combined with the behavior you reported for loading a global
and then matching in IRB, I had feared the world had gone insane. At
least its consistently weird and the regexp match is, in fact, the
culprit.
Phewsh. Combined with the behavior you reported for loading a global
and then matching in IRB, I had feared the world had gone insane. At
least its consistently weird and the regexp match is, in fact, the
culprit.
Why don’t you just find out which characters are in the [:alnum:] and
\w sets?