Forum: Ruby regexp problem /[éê]/ || /é|ê/

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
93b1c03b74dff7a41060d41a2da750ce?d=identicon&s=25 Jonatas Paganini (jonatas)
on 2009-03-06 13:33
Hi, I got a problem try to replace accentuated characters like:

>irb
irb(main):001:0>
irb(main):002:0* name = "Fênix"
=> "F\303\252nix"
irb(main):003:0> name.gsub(/[éê]/,'e')
=> "Feenix"
irb(main):004:0> name.gsub(/é|ê/,'e')
=> "Fenix"

What's the difference between /[éê]/ and /é|ê/ ?

ps: ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
E16e84e861c1815ce11ba7bd851c857d?d=identicon&s=25 lasitha (Guest)
on 2009-03-06 15:15
(Received via mailing list)
On Fri, Mar 6, 2009 at 6:02 PM, Jonatas Paganini <jonatasdp@gmail.com>
wrote:
> Hi, I got a problem try to replace accentuated characters like:
>
> irb(main):002:0* name = "Fênix"
> => "F\303\252nix"
> irb(main):003:0> name.gsub(/[éê]/,'e')
> => "Feenix"
> irb(main):004:0> name.gsub(/é|ê/,'e')
> => "Fenix"

Looks to me like an encoding problem.  What source encoding are you
working in?

If you set $KCODE = 'UTF-8' or append /u to the regex literals does it
resolve the inconsistency?


> What's the difference between /[éê]/ and /é|ê/ ?

In that context there shouldn't be any difference.  The union, |, can
be used for patterns longer than a single character, but the specific
patterns above look equivalent to me.  But if the encoding isn't set
appropriately all bets are off!

> ps: ruby -v
> ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]

ps:  the unicode support has apparently been much improved in 1.9.

Cheers,
lasitha
87ef5d1e14b148eb596433bc17ffe690?d=identicon&s=25 Leo (Guest)
on 2009-03-06 16:37
(Received via mailing list)
> > What's the difference between /[éê]/ and /é|ê/ ?
>
> In that context there shouldn't be any difference

If the source is in utf-8, then ruby 1.8 interpretes [éê] as a choice
of 4 bytes: [195, 169, 195, 170]

Fênix is seen as:
[70, 195, 170, 110, 105, 120]

195 & 170 get replaced with "e", hence Feenix.
93b1c03b74dff7a41060d41a2da750ce?d=identicon&s=25 Jonatas Paganini (jonatas)
on 2009-03-06 18:29
>
> If you set $KCODE = 'UTF-8' or append /u to the regex literals does it
> resolve the inconsistency?

WORKS! setting $KCODE or using /u

interesting!!!

Thanks VERY MUCH!
This topic is locked and can not be replied to.