Puzzling regex behaviour

ismoore · February 15, 2007, 4:40am

On Feb 14, 2007, at 3:10 PM, Ian M. wrote:

\265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316

Is is the same for you?

Ian

Ian M. | When a man is tired of London, he is
tired
[email protected] | of life. – Samuel Johnson
http://www.caliban.org/ |

Yes, the LANG is affecting the result in irb, but not ruby.

$ irb -v
irb 0.9.5(05/04/13)

Whether the irb behavior is “correct” or anomalous is probably a
question for the maintainers to debate. The man page for ctype(3)
(on my Mac OS X 10.4.8) indicates that the macros are supposed to be
based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I’m now more curious as to how irb is finding the character classes.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

ismoore · February 15, 2007, 4:41pm

On 15.02.2007 16:19, Ian M. wrote:

based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I’m now more curious as to how irb is finding the character classes.

It turns out that the poster who mentioned possible interference from
the readline(3) library was right.

That was me.

=> “pr\351f\351r\351es”
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is very unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Yeah, seems so. Unless it’s documented behavior.

Interestingly, adding “require ‘readline’” to the stand-alone script
does not introduce this behaviour, so it must be something to do with
the initialisation that irb does.

It’s really strange as both print the same output. How about doing this

just to be sure that both strings contain the same sequence of bytes:

require ‘enumerator’
foo.to_enum(:each_byte).to_a.join(", ")

Kind regards

robert

ismoore · February 15, 2007, 4:19pm

On Thu 15 Feb 2007 at 12:39:21 +0900, Rob B. wrote:

However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I’m now more curious as to how irb is finding the character classes.

It turns out that the poster who mentioned possible interference from
the readline(3) library was right.

Look at this:

$ irb
irb(main):001:0> foo = “préférées”
=> “pr\351f\351r\351es”
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil

$ irb --noreadline
irb(main):001:0> foo = “préférées”
=> “pr\351f\351r\351es”
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is very unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Interestingly, adding “require ‘readline’” to the stand-alone script
does not introduce this behaviour, so it must be something to do with
the initialisation that irb does.

Ian

ismoore · February 16, 2007, 10:31am

On Fri 16 Feb 2007 at 00:40:08 +0900, Robert K. wrote:

$ irb --noreadline
Interestingly, adding “require ‘readline’” to the stand-alone script
does not introduce this behaviour, so it must be something to do with
the initialisation that irb does.

It’s really strange as both print the same output.

You mean that both of them show foo to contain the same string of bytes?

How about doing this

just to be sure that both strings contain the same sequence of bytes:

require ‘enumerator’
foo.to_enum(:each_byte).to_a.join(", ")

In both cases:

=> “112, 114, 233, 102, 233, 114, 233, 101, 115”

Somehow, it is the regex that is being handled differently, not the
string.

Ian