Character substitution using tr()

cmaxvv · April 22, 2008, 11:46am

I’m using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
“La BohÃ¨me” into “La Boheme”. Here’s the method:

def strip_diacritics(s)
# latin1 subset only
s.tr(“Ã€ÃÃ‚ÃƒÃ„Ã…Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã˜Ã™ÃšÃ›ÃœÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿”,
“AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy”).
gsub(/Ã†/, “AE”).
gsub(/Ã/, “Eth”).
gsub(/Ãž/, “THORN”).
gsub(/ÃŸ/, “ss”).
gsub(/Ã¦/, “ae”).
gsub(/Ã°/, “eth”).
gsub(/Ã¾/, “thorn”)
end

However, it’s breaking for me: Ã¨ is turned into “yy”. I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

“Ã€ÃÃ‚ÃƒÃ„Ã…Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã˜Ã™ÃšÃ›ÃœÃÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¿”.size
=> 110

“AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy”.size
=> 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn’t make any
difference.

thanks, max

cmaxvv · April 22, 2008, 6:45pm

On Tuesday 22 April 2008 11:46:23 Max W. wrote:

     "AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
  gsub(/Ã†/, "AE").
  gsub(/Ã/, "Eth").
  gsub(/Ãž/, "THORN").
  gsub(/ÃŸ/, "ss").
  gsub(/Ã¦/, "ae").
  gsub(/Ã°/, "eth").
  gsub(/Ã¾/, "thorn")

end

With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it’s support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn’t make any
difference.

UTF-8 is variable length encoding, the first half of ascii (which
includes
a-zA-Z) is not encoded at all (=1 byte), anything other is encoded as
2-4
byte chars. Both of the strings are therefore valid UTF-8, but ruby
1.8’s tr
can’t operate on character level, only on byte level.

Jan

cmaxvv · April 22, 2008, 7:05pm

Jan D. wrote:

With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it’s support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

ah…i’m a bit scared to change our project over to ruby 1.9 (i didn’t
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max

cmaxvv · April 22, 2008, 7:28pm

Max W. wrote:

However, it’s breaking for me: Ã¨ is turned into “yy”.

It works if you require ‘jcode’ first.

HTH,
Sebastian

cmaxvv · April 23, 2008, 10:48am

Sebastian H. wrote:

Max W. wrote:

However, it’s breaking for me: Ã¨ is turned into “yy”.

It works if you require ‘jcode’ first.

HTH,
Sebastian

Perfect, thanks! That’s much more palatable than upgrading ruby.

cheers
max