Forum: Ruby Puzzling regex behaviour

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-13 21:20
(Received via mailing list)
Hello,

Can anyone explain this to me?

$ echo $LANG
nl_NL
$ irb -f
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

First question: Why does the final statement return 2 instead of nil?
All characters in foo are alphabetic characters in this locale.

Then:

$ echo $LANG
nl_NL
$ cat ./foo
#!/usr/bin/ruby -w

foo = "préférées"
p foo =~ /[^[:alnum:]]/
p foo =~ /\W/
$ ./foo
2
2

Huh?

Second question: Why does the first regex match now return 2 instead of
nil?

To my way of thinking, both statements should always return nil, whether
or not they are typed into irb or run in a stand-alone script. At the
very least, both statements should return the same answer, regardless of
the context.

What am I missing here?

Ian
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-02-13 22:46
(Received via mailing list)
On 13.02.2007 21:19, Ian Macdonald wrote:
> => nil
> $ cat ./foo
>
> Second question: Why does the first regex match now return 2 instead of
> nil?
>
> To my way of thinking, both statements should always return nil, whether
> or not they are typed into irb or run in a stand-alone script. At the
> very least, both statements should return the same answer, regardless of
> the context.
>
> What am I missing here?

Maybe there is an initialization in .irbrc that leads to a changed
locale inside IRB.  Or your IRB belongs to a different Ruby version on
that system.

Other than that, I guess you tripped into the wide and wild country of
i18n - many strange things can be found there.  Maybe \w and \W only
treat ASCII [a-z] characters as word characters.

Kind regards

  robert
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-13 23:08
(Received via mailing list)
On Wed 14 Feb 2007 at 06:45:08 +0900, Robert Klemme wrote:

> Maybe there is an initialization in .irbrc that leads to a changed
> locale inside IRB.

Nope; I had hoped it would be that easy, but as you can see from my
snippet of output, I started irb with -f, which bypasses ~/.irbrc.
ENV['LANG'] also prints nl_NL in irb, so that can't be it.

> Or your IRB belongs to a different Ruby version on that system.

I compiled it myself, so there has been no mix-and-matching.

> Other than that, I guess you tripped into the wide and wild country of
> i18n - many strange things can be found there.  Maybe \w and \W only
> treat ASCII [a-z] characters as word characters.

It does seem that way, as Perl also appears to treat them this way.

However, I'm still puzzled why there's a difference between irb and a
stand-alone script.

Ian
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-02-13 23:53
(Received via mailing list)
On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> However, I'm still puzzled why there's a difference between irb and a
> stand-alone script.

Maybe your editor saves the script in UTF-8 format. The irb example
clearly encodes the string in ISO-8859-1. That could explain the
difference.
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-02-14 00:02
(Received via mailing list)
On 2/14/07, David Balmain <dbalmain.ml@gmail.com> wrote:
> On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> > However, I'm still puzzled why there's a difference between irb and a
> > stand-alone script.
>
> Maybe your editor saves the script in UTF-8 format. The irb example
> clearly encodes the string in ISO-8859-1. That could explain the
> difference.

For example;

~$ echo $LANG
en_US.ISO-8859-1
~$ irb -f
irb(main):001:0> "pr\351f\351r\351es" =~ /[^[:alnum:]]/
=> nil
irb(main):002:0> "pr\303\251f\303\251r\303\251es" =~ /[^[:alnum:]]/
=> 3

Not exactly what you had but it probably has something to do with the
encoding of the
é.
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 00:44
(Received via mailing list)
On Wed 14 Feb 2007 at 08:01:15 +0900, David Balmain wrote:

>
> ~$ echo $LANG
> en_US.ISO-8859-1
> ~$ irb -f
> irb(main):001:0> "pr\351f\351r\351es" =~ /[^[:alnum:]]/
> => nil
> irb(main):002:0> "pr\303\251f\303\251r\303\251es" =~ /[^[:alnum:]]/
> => 3
>
> Not exactly what you had but it probably has something to do with the
> encoding of the é.

My editor is vim and I run it in the nl_NL locale, so it doesn't start
in UTF-8 mode. To double-check:

:set encoding?
  encoding=latin1

And if we dump my little script:

$ od -c foo
0000000   #   !   /   u   s   r   /   b   i   n   /   r   u   b   y
0000020   -   w  \n  \n   f   o   o       =       "   p   r 351   f 351
0000040   r 351   e   s   "  \n   p       f   o   o       =   ~       /
0000060   [   ^   [   :   a   l   n   u   m   :   ]   ]   /  \n   p
0000100   f   o   o       =   ~       /   \   W   /  \n

You can see that it is, indeed, saved as Latin-1, not UTF-8.

The mystery continues. ;-)

Ian
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 01:00
(Received via mailing list)
On Wed 14 Feb 2007 at 08:43:06 +0900, Ian Macdonald wrote:

> The mystery continues. ;-)

I should have asked by now, but can anyone else reproduce this with
Ruby 1.8.5?

Ian
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-02-14 01:09
(Received via mailing list)
On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> On Wed 14 Feb 2007 at 08:43:06 +0900, Ian Macdonald wrote:
>
> > The mystery continues. ;-)
>
> I should have asked by now, but can anyone else reproduce this with
> Ruby 1.8.5?

I can reproduce this 1.8.4
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 01:14
(Received via mailing list)
On Wed 14 Feb 2007 at 09:08:17 +0900, David Balmain wrote:

> On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
> >
> >I should have asked by now, but can anyone else reproduce this with
> >Ruby 1.8.5?
>
> I can reproduce this 1.8.4

Just to be clear, you are confirming that the following code:

foo = "préférées"
p foo =~ /[^[:alnum:]]/

prints nil in irb and 2 in a stand-alone script when in both cases your
locale is preset to nl_NL?

Ian
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-02-14 01:19
(Received via mailing list)
On 2/14/07, Ian Macdonald <ian@caliban.org> wrote:
>
> foo = "préférées"
> p foo =~ /[^[:alnum:]]/
>
> prints nil in irb and 2 in a stand-alone script when in both cases your
> locale is preset to nl_NL?

Not nl_NL but en_US.ISO-8859-1. I get the same results as you.
Ef3aa7f7e577ea8cd620462724ddf73b?d=identicon&s=25 Rob Biedenharn (Guest)
on 2007-02-14 02:26
(Received via mailing list)
On Feb 13, 2007, at 7:13 PM, Ian Macdonald wrote:

>
> ian@caliban.org             |
> http://www.caliban.org/     |

I'm beginning to wonder if the original question is even accurate.
Doing nothing more than changing the encoding and re-saving the file
(where the value for foo was a cut-n-paste from the email), there
doesn't seem to be any discrpeancy between ruby and irb.  (This
output is from ruby 1.8.5, but 1.8.2 was the same)

rab:code/ruby $ file regexp_and_alnum_versus_w.rb
regexp_and_alnum_versus_w.rb: ISO-8859 text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo = "pr?f?r?es"
alnum = /[^[:alnum:]]/
dubya = /\W/

puts "foo\n  => #{foo.inspect}"
[ alnum, dubya ].each do |re|
   puts "foo =~ #{re}\n  => #{foo =~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
   => "pr\351f\351r\351es"
foo =~ (?-mix:[^[:alnum:]])
   => 2
foo =~ (?-mix:\W)
   => 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
   => "pr\351f\351r\351es"
foo =~ (?-mix:[^[:alnum:]])
   => 2
foo =~ (?-mix:\W)
   => 2
 >> eixt
NameError: undefined local variable or method `eixt' for main:Object
         from (irb):1
 >> exit
rab:code/ruby $ file
regexp_and_alnum_versus_w.rbregexp_and_alnum_versus_w.rb: UTF-8
Unicode text
rab:code/ruby $ cat regexp_and_alnum_versus_w.rb
foo = "préférées"
alnum = /[^[:alnum:]]/
dubya = /\W/

puts "foo\n  => #{foo.inspect}"
[ alnum, dubya ].each do |re|
   puts "foo =~ #{re}\n  => #{foo =~ re}"
end
rab:code/ruby $ ruby regexp_and_alnum_versus_w.rb
foo
   => "pr\303\251f\303\251r\303\251es"
foo =~ (?-mix:[^[:alnum:]])
   => 2
foo =~ (?-mix:\W)
   => 2
rab:code/ruby $ irb -r regexp_and_alnum_versus_w.rb
foo
   => "pr\303\251f\303\251r\303\251es"
foo =~ (?-mix:[^[:alnum:]])
   => 2
foo =~ (?-mix:\W)
   => 2
 >> exit


-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-02-14 10:00
(Received via mailing list)
On 14.02.2007 01:13, Ian Macdonald wrote:
> p foo =~ /[^[:alnum:]]/
>
> prints nil in irb and 2 in a stand-alone script when in both cases your
> locale is preset to nl_NL?

Another idea: maybe the readline lib interferes with encodings somehow
in IRB?  What happens if you invoke your script from within IRB via
"load"?

Kind regards

  robert
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 15:42
(Received via mailing list)
On Wed 14 Feb 2007 at 18:00:22 +0900, Robert Klemme wrote:

> Another idea: maybe the readline lib interferes with encodings somehow
> in IRB?  What happens if you invoke your script from within IRB via "load"?

It runs as if run from the command line:

irb(main):001:0> load 'foo'
2
2

Ian
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 16:18
(Received via mailing list)
On Wed 14 Feb 2007 at 10:25:10 +0900, Rob Biedenharn wrote:

> alnum = /[^[:alnum:]]/
>   => 2
What is your locale? I strongly suspect it's either unset or set to C.
In those cases, I get the same results as you.

If you use en_US or nl_NL, you'll find (or at least, I find) that
'foo =~ /[^[:alnum:]]/' returns nil in irb and 2 from a stand-alone
script.

In fact, even irb returns a different value from the command line. This
is bizarre:

$ irb -f < foo2
foo = "préférées"
"pr\351f\351r\351es"
foo =~ /[^[:alnum:]]/
2
foo =~ /\W/
2

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil
irb(main):003:0> foo =~ /\W/
=> 2

As you can see, interactively irb returns nil for that first regex
match.

Ian
852a62a28f1de229dc861ce903b07a60?d=identicon&s=25 Gavin Kistner (phrogz)
on 2007-02-14 16:40
(Received via mailing list)
On Feb 14, 8:17 am, Ian Macdonald <i...@caliban.org> wrote:
> As you can see, interactively irb returns nil for that first regex match.

Aren't you making the assumption that it's the regex at fault here,
and not the encoding of the string when you enter it in irb?

What if you do:

gavinkistner$ cat set_foo.rb
  $foo = "préférés"

gavinkistner$ irb
irb(main):001:0> load 'set_foo.rb'
=> true
irb(main):001:0> $foo =~ /[^[:alnum:]]/
???
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 16:49
(Received via mailing list)
On Thu 15 Feb 2007 at 00:40:09 +0900, Phrogz wrote:

>
> gavinkistner$ irb
> irb(main):001:0> load 'set_foo.rb'
> => true
> irb(main):001:0> $foo =~ /[^[:alnum:]]/
> ???

$ irb
irb(main):001:0> load 'foo'
=> true
irb(main):002:0> $foo
=> "pr\351f\351r\351es"
irb(main):003:0> $foo =~ /[^[:alnum:]]/
=> nil

It's still nil, I'm afraid.

Ian
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 16:52
(Received via mailing list)
On Wed 14 Feb 2007 at 23:42:10 +0900, Ian Macdonald wrote:

> On Wed 14 Feb 2007 at 18:00:22 +0900, Robert Klemme wrote:
>
> > Another idea: maybe the readline lib interferes with encodings somehow
> > in IRB?  What happens if you invoke your script from within IRB via "load"?
>
> It runs as if run from the command line:
>
> irb(main):001:0> load 'foo'
> 2
> 2

I beg your pardon. I must have had the locale set incorrectly on that
run. It runs as if typed interactively into irb:

$ irb
irb(main):001:0> load 'foo'
nil
2

Ian
852a62a28f1de229dc861ce903b07a60?d=identicon&s=25 Gavin Kistner (phrogz)
on 2007-02-14 18:06
(Received via mailing list)
On Feb 14, 8:51 am, Ian Macdonald <i...@caliban.org> wrote:
> > 2
> > 2
>
> I beg your pardon. I must have had the locale set incorrectly on that
> run. It runs as if typed interactively into irb:
>
> $ irb
> irb(main):001:0> load 'foo'
> nil
> 2

Phewsh. Combined with the behavior you reported for loading a global
and then matching in IRB, I had feared the world had gone insane. At
least its consistently weird and the regexp match is, in fact, the
culprit.
Ef3aa7f7e577ea8cd620462724ddf73b?d=identicon&s=25 Rob Biedenharn (Guest)
on 2007-02-14 19:17
(Received via mailing list)
On Feb 14, 2007, at 12:05 PM, Phrogz wrote:

>>> It runs as if run from the command line:
>> nil
>> 2
>
> Phewsh. Combined with the behavior you reported for loading a global
> and then matching in IRB, I had feared the world had gone insane. At
> least its consistently weird and the regexp match is, in fact, the
> culprit.

Why don't you just find out which characters are in the [:alnum:] and
\w sets?

 >> alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
 >> dubyas = (0..0377).select {|c| c.chr =~ /\w/ }.map {|c|c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

$ LANG=nl_NL irb
 >> alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\252
\265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316
\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\337\340
\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361
\362\363\364\365\366\370\371\372\373\374\375\376\377"
 >> dubyas = (0..0377).select {|c| c.chr =~ /\w/ }.map {|c|c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

sheesh!

-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-14 21:10
(Received via mailing list)
On Thu 15 Feb 2007 at 03:16:27 +0900, Rob Biedenharn wrote:

> \362\363\364\365\366\370\371\372\373\374\375\376\377"
> >> dubyas = (0..0377).select {|c| c.chr =~ /\w/ }.map {|c|c.chr}.join
> => "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

Yes, but all this really does is indicate that the irb behaviour is
the correct one.

When I run this in a stand-alone script, I get this:

$ LANG=nl_NL ./foo
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

It's almost as if the locale isn't being propagated to the process via
the environment. But...

$ LANG=nl_NL ruby -e "puts ENV['LANG']"
nl_NL

...it _is_ being propagated.

Is is the same for you?

Ian
Ef3aa7f7e577ea8cd620462724ddf73b?d=identicon&s=25 Rob Biedenharn (Guest)
on 2007-02-15 04:40
(Received via mailing list)
On Feb 14, 2007, at 3:10 PM, Ian Macdonald wrote:

>> \265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316
>
>
> Is is the same for you?
>
> Ian
> --
> Ian Macdonald               | When a man is tired of London, he is
> tired
> ian@caliban.org             | of life.   -- Samuel Johnson
> http://www.caliban.org/     |

Yes, the LANG is affecting the result in irb, but not ruby.

$ irb -v
irb 0.9.5(05/04/13)

Whether the irb behavior is "correct" or anomalous is probably a
question for the maintainers to debate.  The man page for ctype(3)
(on my Mac OS X 10.4.8) indicates that the macros are supposed to be
based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I'm now more curious as to how irb is finding the character classes.

-Rob

Rob Biedenharn    http://agileconsultingllc.com
Rob@AgileConsultingLLC.com
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-15 16:19
(Received via mailing list)
On Thu 15 Feb 2007 at 12:39:21 +0900, Rob Biedenharn wrote:

> However, a quick C program shows effectively the same behavior as
> ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
> I'm now more curious as to how irb is finding the character classes.

It turns out that the poster who mentioned possible interference from
the readline(3) library was right.

Look at this:

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil

$ irb --noreadline
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is _very_ unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Interestingly, adding "require 'readline'" to the stand-alone script
does _not_ introduce this behaviour, so it must be something to do with
the initialisation that irb does.

Ian
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-02-15 16:41
(Received via mailing list)
On 15.02.2007 16:19, Ian Macdonald wrote:
>> based on the locale and my copy of the pickaxe (p.71) says that the
>> character classes are based on the ctype macros of the same name.
>> However, a quick C program shows effectively the same behavior as
>> ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
>> I'm now more curious as to how irb is finding the character classes.
>
> It turns out that the poster who mentioned possible interference from
> the readline(3) library was right.

That was me. :-)

> => "pr\351f\351r\351es"
> irb(main):002:0> foo =~ /[^[:alnum:]]/
> => 2
>
> This is _very_ unexpected and undesirable behaviour and, as such,
> probably qualifies as a bug.

Yeah, seems so.  Unless it's documented behavior. :-)

> Interestingly, adding "require 'readline'" to the stand-alone script
> does _not_ introduce this behaviour, so it must be something to do with
> the initialisation that irb does.

It's really strange as both print the same output.  How about doing this
- just to be sure that both strings contain the same sequence of bytes:

require 'enumerator'
foo.to_enum(:each_byte).to_a.join(", ")

Kind regards

  robert
64cd30a9bd1e563ae777cc6a5fc09a51?d=identicon&s=25 Ian Macdonald (ianmacd)
on 2007-02-16 10:31
(Received via mailing list)
On Fri 16 Feb 2007 at 00:40:08 +0900, Robert Klemme wrote:

> >$ irb --noreadline
> >Interestingly, adding "require 'readline'" to the stand-alone script
> >does _not_ introduce this behaviour, so it must be something to do with
> >the initialisation that irb does.
>
> It's really strange as both print the same output.

You mean that both of them show foo to contain the same string of bytes?

> How about doing this
> - just to be sure that both strings contain the same sequence of bytes:
>
> require 'enumerator'
> foo.to_enum(:each_byte).to_a.join(", ")

In both cases:

=> "112, 114, 233, 102, 233, 114, 233, 101, 115"

Somehow, it is the regex that is being handled differently, not the
string.

Ian
This topic is locked and can not be replied to.