Forum: Ruby win32 ruby1.9 regexp and cyrillic string

B20ac503be09d1404a654c43c2a73285?d=identicon&s=25 Nikolay Khodyunya (stdcall)
on 2010-04-27 16:29
#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why /\w/ ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]
Bec38d63650c8912b6ba9b557fb953b9?d=identicon&s=25 Roger Pack (rogerdpack)
on 2010-04-27 18:55
> str2.gsub!(/\w/u,'') #removes only latin characters

> The question is why /\w/ ignore cyrillic characters?

Are cyrillic characters supposed to count as "word characters"? (\w) ?
If so then looks like a bug to me. Ping core.
-rp
Bec38d63650c8912b6ba9b557fb953b9?d=identicon&s=25 Roger Pack (rogerdpack)
on 2010-04-27 19:20
Nikolay Khodyunya wrote:
> #coding: utf-8
> str2 = "asdfМикимаус"
> p str2.encoding #<Encoding:UTF-8>
> p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
> str2.gsub!(/\w/u,'') #removes only latin characters
> puts str2
>
> The question is why /\w/ ignore cyrillic characters?
>
> I have installed latest ruby package from http://rubyinstaller.org/.
> Here is my output of ruby -v
> ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

http://redmine.ruby-lang.org/issues/show/3181
http://redmine.ruby-lang.org/issues/show/3202

might be related. If you think it's wrong then bring it up on core.
-rp
Ab870531383eea6e4d9110317f5401e7?d=identicon&s=25 Caleb Clausen (Guest)
on 2010-04-28 16:21
(Received via mailing list)
On 4/27/10, Nikolay Khodyunya <nickolayho@gmail.com> wrote:
> #coding: utf-8
> str2 = "asdfМикимаус"
> p str2.encoding #<Encoding:UTF-8>
> p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
> str2.gsub!(/\w/u,'') #removes only latin characters
> puts str2
>
> The question is why /\w/ ignore cyrillic characters?

I think that \w (and similar shortcuts) is supposed to match ascii
characters only... thus it's equivalent to [a-zA-Z].

Isn't there some kind of unicode character class you can use?
1178f423e1e99917e1627d2a0f5b3a1b?d=identicon&s=25 Dominic Rose (dominicr)
on 2010-05-10 19:59
Caleb Clausen wrote:
> On 4/27/10, Nikolay Khodyunya <nickolayho@gmail.com> wrote:
>> #coding: utf-8
>> str2 = "asdfМикимаус"
>> p str2.encoding #<Encoding:UTF-8>
>> p str2.scan /\p{Cyrillic}/ #found all cyrillic charachters
>> str2.gsub!(/\w/u,'') #removes only latin characters
>> puts str2
>>
>> The question is why /\w/ ignore cyrillic characters?
>
> I think that \w (and similar shortcuts) is supposed to match ascii
> characters only... thus it's equivalent to [a-zA-Z].
>
> Isn't there some kind of unicode character class you can use?

Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
problem is from the windows package.

you can use "asdfМикимаус".gsub!(/\p{L}/,'') to remove letters thought
Bec38d63650c8912b6ba9b557fb953b9?d=identicon&s=25 Roger Pack (rogerdpack)
on 2010-05-10 20:01
> Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
> problem is from the windows package.
>
> you can use "asdfМикимаус".gsub!(/\p{L}/,'') to remove letters thought

If they're the same version then it might be a window bug.  Try it with
trunk and if it still fails then submit a bug report to the tracker...
Bec38d63650c8912b6ba9b557fb953b9?d=identicon&s=25 Roger Pack (rogerdpack)
on 2010-05-10 20:33
Roger Pack wrote:
>
>> Actually "asdfМикимаус".gsub!(/\w/u,'') returns "" on linux so the
>> problem is from the windows package.

Here's a copy of trunk if that would be useful:

http://rubydoc.ruby-forum.com/ruby_distros/ruby_tr...

GL.
-rp
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.