Regexp problem /[Ã©Ãª]/ || /Ã©|Ãª/

jonatas · March 6, 2009, 1:33pm

Hi, I got a problem try to replace accentuated characters like:

irb
irb(main):001:0>
irb(main):002:0* name = “FÃªnix”
=> “F\303\252nix”
irb(main):003:0> name.gsub(/[Ã©Ãª]/,‘e’)
=> “Feenix”
irb(main):004:0> name.gsub(/Ã©|Ãª/,‘e’)
=> “Fenix”

What’s the difference between /[Ã©Ãª]/ and /Ã©|Ãª/ ?

ps: ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]

jonatas · March 6, 2009, 4:37pm

What’s the difference between /[éê]/ and /é|ê/ ?

In that context there shouldn’t be any difference

If the source is in utf-8, then ruby 1.8 interpretes [éê] as a choice
of 4 bytes: [195, 169, 195, 170]

Fênix is seen as:
[70, 195, 170, 110, 105, 120]

195 & 170 get replaced with “e”, hence Feenix.

jonatas · March 6, 2009, 3:15pm

On Fri, Mar 6, 2009 at 6:02 PM, Jonatas P. [email protected]
wrote:

Hi, I got a problem try to replace accentuated characters like:

irb(main):002:0* name = “Fênix”
=> “F\303\252nix”
irb(main):003:0> name.gsub(/[éê]/,‘e’)
=> “Feenix”
irb(main):004:0> name.gsub(/é|ê/,‘e’)
=> “Fenix”

Looks to me like an encoding problem. What source encoding are you
working in?

If you set $KCODE = ‘UTF-8’ or append /u to the regex literals does it
resolve the inconsistency?

What’s the difference between /[éê]/ and /é|ê/ ?

In that context there shouldn’t be any difference. The union, |, can
be used for patterns longer than a single character, but the specific
patterns above look equivalent to me. But if the encoding isn’t set
appropriately all bets are off!

ps: ruby -v
ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]

ps: the unicode support has apparently been much improved in 1.9.

Cheers,
lasitha

jonatas · March 6, 2009, 6:29pm

If you set $KCODE = ‘UTF-8’ or append /u to the regex literals does it
resolve the inconsistency?

WORKS! setting $KCODE or using /u

interesting!!!

Thanks VERY MUCH!