Character substitution using tr()

I’m using a method that i found at the acts as ferret site:

http://projects.jkraemer.net/acts_as_ferret/#UTF-8support

which is intended to strip accents out of strings, turning for example
“La Bohème” into “La Boheme”. Here’s the method:

def strip_diacritics(s)
# latin1 subset only
s.tr(“ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝà áâãäåçèéêëìíîïñòóôõöøùúûüýÿ”,
“AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy”).
gsub(/Æ/, “AE”).
gsub(/Ð/, “Eth”).
gsub(/Þ/, “THORN”).
gsub(/ß/, “ss”).
gsub(/æ/, “ae”).
gsub(/ð/, “eth”).
gsub(/þ/, “thorn”)
end

However, it’s breaking for me: è is turned into “yy”. I think this is
to do with the number of bytes used: the first string passed to tr()
uses 2 bytes per character while the second uses 1 byte per character:

“ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝà áâãäåçèéêëìíîïñòóôõöøùúûüýÿ”.size
=> 110

“AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy”.size
=> 55

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn’t make any
difference.

thanks, max

On Tuesday 22 April 2008 11:46:23 Max W. wrote:

     "AAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy").
  gsub(/Æ/, "AE").
  gsub(/Ð/, "Eth").
  gsub(/Þ/, "THORN").
  gsub(/ß/, "ss").
  gsub(/æ/, "ae").
  gsub(/ð/, "eth").
  gsub(/þ/, "thorn")

end

With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it’s support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

Assuming this is the problem, can anyone tell me how to get around it?
I know next to nothing about character encoding: i tried converting both
translation strings to utf8 with String#toutf8, but that didn’t make any
difference.

UTF-8 is variable length encoding, the first half of ascii (which
includes
a-zA-Z) is not encoded at all (=1 byte), anything other is encoded as
2-4
byte chars. Both of the strings are therefore valid UTF-8, but ruby
1.8’s tr
can’t operate on character level, only on byte level.

Jan

Jan D. wrote:

With ruby 1.9 your code works fine without modifications, with ruby 1.8
and
it’s support for unicode (or lack of thereof) it might be quite a
problem to
get it working.

ah…i’m a bit scared to change our project over to ruby 1.9 (i didn’t
know there was a 1.9) to solve this problem. I ended up just picking
the most commonly used accents and doing individual gsubs on the strings
to swap them out. Feels dirty but it works.

Thanks a lot for the info!
max

Max W. wrote:

However, it’s breaking for me: è is turned into “yy”.

It works if you require ‘jcode’ first.

HTH,
Sebastian

Sebastian H. wrote:

Max W. wrote:

However, it’s breaking for me: è is turned into “yy”.

It works if you require ‘jcode’ first.

HTH,
Sebastian

Perfect, thanks! That’s much more palatable than upgrading ruby.

cheers
max

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs