Forum: Ruby-core Regex matching errors when using \W character class and /i option

Posted by ben_h (Ben Hoskings) (Guest)
on 2012-12-19 00:15
(Received via mailing list)
Issue #4044 has been updated by ben_h (Ben Hoskings).


Hi all, long time no see :)

naruse (Yui NARUSE) wrote:
>  \W includes U+212A and U+00DF
>  /i adds U+006B (k) and U+0073 (S) to [\W]
>  ^ reverses the class; it doesn't include k & S.

I think I see the misunderstanding: there are multiple characters that 
render as 'k' and 's'.

K, S, k, s are basic word characters, and so [^\W] should match them 
(along with all A-Z and a-z):
0x004B (Latin capital letter K)
0x0053 (Latin capital letter S)
0x006B (Latin capital letter k)
0x0073 (Latin capital letter s)

But, I'm not sure how [^\W] should treat these characters:
0x00DF (Latin small letter sharp s)
0x017F (Latin small letter long s)
0x212A (Kelvin sign)


The important thing is that all the characters in A-Z (0x41-0x5A) & a-z 
(0x61-0x7A) are word characters, so [^\W] should match all of them.

Cheers,
Ben

----------------------------------------
Bug #4044: Regex matching errors when using \W character class and /i 
option
https://bugs.ruby-lang.org/issues/4044#change-34835

Author: ben_h (Ben Hoskings)
Status: Feedback
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: core
Target version: 1.9.2
ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]


=begin
 Hi all,

 Josh Bassett and I just discovered an issue with regex matches on 
ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)

 The case-insensitive (/i) option together with the non-word character 
class (\W) match inconsistently against the alphabet. Specifically the 
regex doesn't match properly against the letters 'k' and 's'.

 The following expression demonstrates the problem in irb:

     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }

 As a reference, the following two expressions are working properly:

     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }

 Cheers
 Ben Hoskings & Josh Bassett
=end
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.