Removing diacritical marks


#1

Hello Rubyists,

I would like to remove the accents marks (a.k.a diacritical marks) from
a
String. Assuming “line” is a String, this gets most of them:

line.gsub!(/[�Á���]/,"A")
line.gsub!(/[âãäà á]/,"a")
line.gsub!(/[Ã?Ã?Ã?Ã?]/,"E")
line.gsub!(/[êëèé]/,"e")
line.gsub!(/[�Í�Ï]/,"I")
line.gsub!(/[îïìí]/,"i")
line.gsub!(/[Ã?Ã?Ã?Ã?Ã?]/,"O")
line.gsub!(/[ôõöòó]/,"o")
line.gsub!(/[Ã?Ã?Ã?Ã?]/,"U")
line.gsub!(/[ûüùú]/,"u")
line.gsub!(/Ý/,"Y")
line.gsub!(/ý/,"y")
line.gsub!(/ñ/,"n")

Is there an easier/better way to do this?


#2

Paul B. wrote:

I would like to remove the accents marks (a.k.a diacritical marks) from a
String. Assuming “line” is a String, this gets most of them:

line.gsub!(/[ÀÁÂÃÄ]/,"A")


Is there an easier/better way to do this?

Yes. There’s a potential problem with your way: if the accented
characters
are more than one byte (i.e. in any character set other than ASCII) each
byte will be replaced with an A: “À” => “AA”.

This is safer: line.gsub!(/À|Á|Â|Ã|Ä]/,“A”)

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google G.]

Cheers,
Dave


#3

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google G.]

Here’s a simpler version (hard-coded for UTF-8; it would need some
tweaking for other encodings). It has a side effect of transliterating
punctuation to ASCII as well, which may or may not be desirable.

Paul


$KCODE = ‘u’
require ‘iconv’

class String
def strip_diacritics
self.gsub(/[^\x20-\x7f]/){
Iconv.iconv(‘us-ascii//IGNORE//TRANSLIT’, ‘utf-8’,
$&)[0].sub(/^^`’"~/i, ‘’)
}
end
end

require ‘test/unit’
class TestStripDiacritics < Test::Unit::TestCase

def test_upper_case
assert_equal(‘AAAAA’, ‘ÀÁÂÃÄ’.strip_diacritics)
assert_equal(‘EEEE’, ‘ÈÉÊË’.strip_diacritics)
assert_equal(‘IIII’, ‘ÌÍÎÏ’.strip_diacritics)
assert_equal(‘OOOOO’, ‘ÒÓÔÕÖ’.strip_diacritics)
assert_equal(‘UUUU’, ‘ÙÚÛÜ’.strip_diacritics)
assert_equal(‘Y’, ‘Ý’.strip_diacritics)
assert_equal(‘N’, ‘Ñ’.strip_diacritics)
end

def test_lower_case
assert_equal(‘aaaaa’, ‘âãäàá’.strip_diacritics)
assert_equal(‘eeee’, ‘êëèé’.strip_diacritics)
assert_equal(‘iiii’, ‘îïìí’.strip_diacritics)
assert_equal(‘ooooo’, ‘ôõöòó’.strip_diacritics)
assert_equal(‘uuuu’, ‘ûüùú’.strip_diacritics)
assert_equal(‘y’, ‘ý’.strip_diacritics)
assert_equal(‘n’, ‘ñ’.strip_diacritics)
end

def test_words
assert_equal(‘Internationalizaetion’,
‘Iñtërnâtiônàlizætiøn’.strip_diacritics)
end

def test_punctuation
assert_equal(’-’, ‘?’.strip_diacritics)
assert_equal("’’", “’’”.strip_diacritics)
end
end