Win32 ruby1.9 regexp and cyrillic string

karevn · April 27, 2010, 4:29pm

#coding: utf-8
str2 = “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”
p str2.encoding #Encoding:UTF-8
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,‘’) #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

karevn · April 27, 2010, 6:55pm

str2.gsub!(/\w/u,’’) #removes only latin characters

The question is why /\w/ ignore cyrillic characters?

Are cyrillic characters supposed to count as “word characters”? (\w) ?
If so then looks like a bug to me. Ping core.
-rp

karevn · April 27, 2010, 7:20pm

Nikolay K. wrote:

#coding: utf-8
str2 = “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”
p str2.encoding #Encoding:UTF-8
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,‘’) #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

http://redmine.ruby-lang.org/issues/show/3181
http://redmine.ruby-lang.org/issues/show/3202

might be related. If you think it’s wrong then bring it up on core.
-rp

karevn · May 10, 2010, 7:59pm

Caleb C. wrote:

On 4/27/10, Nikolay K. [email protected] wrote:

#coding: utf-8
str2 = “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”
p str2.encoding #Encoding:UTF-8
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,‘’) #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I think that \w (and similar shortcuts) is supposed to match ascii
characters only… thus it’s equivalent to [a-zA-Z].

Isn’t there some kind of unicode character class you can use?

Actually “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”.gsub!(/\w/u,‘’) returns “” on linux so the
problem is from the windows package.

you can use “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”.gsub!(/\p{L}/,‘’) to remove letters thought

karevn · May 10, 2010, 8:01pm

Actually “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”.gsub!(/\w/u,’’) returns “” on linux so the
problem is from the windows package.

you can use “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”.gsub!(/\p{L}/,’’) to remove letters thought

If they’re the same version then it might be a window bug. Try it with
trunk and if it still fails then submit a bug report to the tracker…

karevn · April 28, 2010, 4:21pm

On 4/27/10, Nikolay K. [email protected] wrote:

#coding: utf-8
str2 = “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”
p str2.encoding #Encoding:UTF-8
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,‘’) #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I think that \w (and similar shortcuts) is supposed to match ascii
characters only… thus it’s equivalent to [a-zA-Z].

Isn’t there some kind of unicode character class you can use?

karevn · May 10, 2010, 8:33pm

Roger P. wrote:

Actually “asdfÐœÐ¸ÐºÐ¸Ð¼Ð°ÑƒÑ”.gsub!(/\w/u,‘’) returns “” on linux so the
problem is from the windows package.

Here’s a copy of trunk if that would be useful:

http://rubydoc.ruby-forum.com/ruby_distros/ruby_trunk_no_patches_installed.7z

GL.
-rp