Forum: Ruby accents and String#tr

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
7223c62b7310e164eb79c740188abbda?d=identicon&s=25 Xavier Noria (Guest)
on 2006-02-21 00:28
(Received via mailing list)
I wrote this method

   def self.normalize_for_sorting(s)
     return nil unless s
     norm = s.downcase
     norm.tr!('ÁÉÍÓÚ', 'aeiou')
     norm.tr!('ÀÈÌÒÙ', 'aeiou')
     norm.tr!('ÄËÏÖÜ', 'aeiou')
     norm.tr!('ÂÊÎÔÛ', 'aeiou')
     norm.tr!('áéíóú', 'aeiou')
     norm.tr!('àèìòù', 'aeiou')
     norm.tr!('äëïöü', 'aeiou')
     norm.tr!('âêîôû', 'aeiou')
     norm
   end

to normalize strings for sorting. This script is UTF-8, everything is
UTF-8 in my application, $KCODE is 'u'.

But it does not work, examples:

    Andrés -> andruos
    López  -> luupez
    Pérez  -> puorez

I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to
no avail. Any ideas?

-- fxn
430ea1cba106cc65b7687d66e9df4f06?d=identicon&s=25 David Vallner (Guest)
on 2006-02-21 03:54
(Received via mailing list)
DÅ?a Utorok 21 Február 2006 00:28 Xavier Noria napísal:
>      norm.tr!('àèìòù', 'aeiou')
>     Andrés -> andruos
>     López  -> luupez
>     Pérez  -> puorez
>
> I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to
> no avail. Any ideas?
>
> -- fxn

Apparently, not all String methods were created equal:

david@chello082119107152:~$ irb
irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> require 'jcode'
=> true
irb(main):003:0> "Andrés".tr("áéíóú", "aeiou")
=> "Andres"


jcode to the rescue!

David Vallner
Cd49db0b676767ea4358b1047c4cddd2?d=identicon&s=25 Robin Stocker (Guest)
on 2006-02-21 14:55
(Received via mailing list)
Xavier Noria wrote:
>     norm.tr!('àèìòù', 'aeiou')
>    Andrés -> andruos
>    López  -> luupez
>    Pérez  -> puorez
>
> I tried to "force" it with Iconv.conv('UTF-8', 'ASCII', 'aeiou') to no
> avail. Any ideas?
>
> -- fxn

Hi,

My guess is that the "tr" method treats its arguments as a string of
bytes. And because characters with accents need more than 1 byte in
UTF-8, #tr doesn't do what you would expect it to. (It's not even tr's
fault, how is it supposed to know that two bytes actually represent a
single character?)

The solution is not to use #tr!, but #gsub!. It isn't as short, but at
least it's right ;)

   norm.gsub!('ä', 'a')
   norm.gsub!('ë', 'e')
   # and so on...

And because that is against DRY (Don't Repeat Yourself), I would
recommend storing the mapping as a hash:

   accents = { 'ä' => 'a', 'ë' => 'e', ... }
   accents.each do |accent, replacement|
     norm.gsub!(accent, replacement)
   end

Regards,
   Robin Stocker
This topic is locked and can not be replied to.