Forum: Ruby-core [ruby-trunk - Bug #7501][Open] \w in a regular expression doesn't match international characters

Posted by eltomito (Tomas Partl) (Guest)
on 2012-12-03 10:22
(Received via mailing list)
Issue #7501 has been reported by eltomito (Tomas Partl).

----------------------------------------
Bug #7501: \w in a regular expression doesn't match international 
characters
https://bugs.ruby-lang.org/issues/7501

Author: eltomito (Tomas Partl)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]


When using regexp matching, \w doesn't match characters which are not in 
the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by 
\w but aren't.

This program demonstrates the bug:

--------------------------------------------------------
# encoding: utf-8
match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" )  #some German characters
puts match.to_s
----------------------------------------------------------

Expected output:
----------------------------------------------------------
abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö
Posted by charliesome (Charlie Somerville) (Guest)
on 2012-12-03 13:27
(Received via mailing list)
Issue #7501 has been updated by charliesome (Charlie Somerville).


/[[:alpha:]]+/ should behave as you expect
----------------------------------------
Bug #7501: \w in a regular expression doesn't match international 
characters
https://bugs.ruby-lang.org/issues/7501#change-34360

Author: eltomito (Tomas Partl)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]


When using regexp matching, \w doesn't match characters which are not in 
the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by 
\w but aren't.

This program demonstrates the bug:

--------------------------------------------------------
# encoding: utf-8
match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" )  #some German characters
puts match.to_s
----------------------------------------------------------

Expected output:
----------------------------------------------------------
abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö
Posted by shyouhei (Shyouhei Urabe) (Guest)
on 2012-12-03 19:44
(Received via mailing list)
Issue #7501 has been updated by shyouhei (Shyouhei Urabe).

Status changed from Open to Rejected

If I remember correctly this is an intentional design.  Because as 
Unicode version grows, the definition of what is a word character and 
what is not changes form time to time.  It is hard for us to follow 
that.
----------------------------------------
Bug #7501: \w in a regular expression doesn't match international 
characters
https://bugs.ruby-lang.org/issues/7501#change-34380

Author: eltomito (Tomas Partl)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version:
ruby -v: ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]


When using regexp matching, \w doesn't match characters which are not in 
the English alphabet.
For example, the characters "žščřďťňaáéíóůúý" should all be matched by 
\w but aren't.

This program demonstrates the bug:

--------------------------------------------------------
# encoding: utf-8
match = /\w+/.match( "abcdefghijklmnopqrstuvwxyz" )
puts match.to_s

match = /\w+/.match( "áéíóůúýžščřďťň" ) #some Czech characters
puts match.to_s

match = /\w+/.match( "üäö" )  #some German characters
puts match.to_s
----------------------------------------------------------

Expected output:
----------------------------------------------------------
abcdefghijklmnopqrstuvwxyz
áéíóůúýžščřďťň
üäö
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.