Accents and String#tr

I wrote this method

def self.normalize_for_sorting(s)
return nil unless s
norm = s.downcase
norm.tr!(‘ÁÉÍÓÚ’, ‘aeiou’)
norm.tr!(‘ÀÈÌÒÙ’, ‘aeiou’)
norm.tr!(‘ÄËÏÖÜ’, ‘aeiou’)
norm.tr!(‘ÂÊÎÔÛ’, ‘aeiou’)
norm.tr!(‘áéíóú’, ‘aeiou’)
norm.tr!(‘àèìòù’, ‘aeiou’)
norm.tr!(‘äëïöü’, ‘aeiou’)
norm.tr!(‘âêîôû’, ‘aeiou’)
norm
end

to normalize strings for sorting. This script is UTF-8, everything is
UTF-8 in my application, $KCODE is ‘u’.

But it does not work, examples:

Andrés -> andruos
López  -> luupez
Pérez  -> puorez

I tried to “force” it with Iconv.conv(‘UTF-8’, ‘ASCII’, ‘aeiou’) to
no avail. Any ideas?

– fxn

DÅ?a Utorok 21 Február 2006 00:28 Xavier N. napísal:

 norm.tr!('à èìòù', 'aeiou')
Andrés -> andruos
López  -> luupez
Pérez  -> puorez

I tried to “force” it with Iconv.conv(‘UTF-8’, ‘ASCII’, ‘aeiou’) to
no avail. Any ideas?

– fxn

Apparently, not all String methods were created equal:

david@chello082119107152:~$ irb
irb(main):001:0> $KCODE = ‘u’
=> “u”
irb(main):002:0> require ‘jcode’
=> true
irb(main):003:0> “Andrés”.tr(“áéíóú”, “aeiou”)
=> “Andres”

jcode to the rescue!

David V.

Xavier N. wrote:

norm.tr!('àèìòù', 'aeiou')

Andrés -> andruos
López -> luupez
Pérez -> puorez

I tried to “force” it with Iconv.conv(‘UTF-8’, ‘ASCII’, ‘aeiou’) to no
avail. Any ideas?

– fxn

Hi,

My guess is that the “tr” method treats its arguments as a string of
bytes. And because characters with accents need more than 1 byte in
UTF-8, #tr doesn’t do what you would expect it to. (It’s not even tr’s
fault, how is it supposed to know that two bytes actually represent a
single character?)

The solution is not to use #tr!, but #gsub!. It isn’t as short, but at
least it’s right :wink:

norm.gsub!(‘ä’, ‘a’)
norm.gsub!(‘ë’, ‘e’)

and so on…

And because that is against DRY (Don’t Repeat Yourself), I would
recommend storing the mapping as a hash:

accents = { ‘ä’ => ‘a’, ‘ë’ => ‘e’, … }
accents.each do |accent, replacement|
norm.gsub!(accent, replacement)
end

Regards,
Robin S.