Forum: Ruby regexp problem /[éê]/ || /é|ê/

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Jonatas P. (Guest)
on 2009-03-06 14:33
Hi, I got a problem try to replace accentuated characters like:

>irb
irb(main):001:0>
irb(main):002:0* name = "Fênix"
=> "F\303\252nix"
irb(main):003:0> name.gsub(/[éê]/,'e')
=> "Feenix"
irb(main):004:0> name.gsub(/é|ê/,'e')
=> "Fenix"

What's the difference between /[éê]/ and /é|ê/ ?

ps: ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
lasitha (Guest)
on 2009-03-06 16:15
(Received via mailing list)
On Fri, Mar 6, 2009 at 6:02 PM, Jonatas P. 
<removed_email_address@domain.invalid>
wrote:
> Hi, I got a problem try to replace accentuated characters like:
>
> irb(main):002:0* name = "Fênix"
> => "F\303\252nix"
> irb(main):003:0> name.gsub(/[éê]/,'e')
> => "Feenix"
> irb(main):004:0> name.gsub(/é|ê/,'e')
> => "Fenix"

Looks to me like an encoding problem.  What source encoding are you
working in?

If you set $KCODE = 'UTF-8' or append /u to the regex literals does it
resolve the inconsistency?


> What's the difference between /[éê]/ and /é|ê/ ?

In that context there shouldn't be any difference.  The union, |, can
be used for patterns longer than a single character, but the specific
patterns above look equivalent to me.  But if the encoding isn't set
appropriately all bets are off!

> ps: ruby -v
> ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]

ps:  the unicode support has apparently been much improved in 1.9.

Cheers,
lasitha
Leo (Guest)
on 2009-03-06 17:37
(Received via mailing list)
> > What's the difference between /[éê]/ and /é|ê/ ?
>
> In that context there shouldn't be any difference

If the source is in utf-8, then ruby 1.8 interpretes [éê] as a choice
of 4 bytes: [195, 169, 195, 170]

Fênix is seen as:
[70, 195, 170, 110, 105, 120]

195 & 170 get replaced with "e", hence Feenix.
Jonatas P. (Guest)
on 2009-03-06 19:29
>
> If you set $KCODE = 'UTF-8' or append /u to the regex literals does it
> resolve the inconsistency?

WORKS! setting $KCODE or using /u

interesting!!!

Thanks VERY MUCH!
This topic is locked and can not be replied to.