Forum: Ruby removing diacritical marks

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
6661ef9d747db3af8896cd94959d717d?d=identicon&s=25 Paul Barry (Guest)
on 2006-03-17 04:45
(Received via mailing list)
Hello Rubyists,

I would like to remove the accents marks (a.k.a diacritical marks) from
a
String.  Assuming "line" is a String, this gets most of them:

    line.gsub!(/[�Á���]/,"A")
    line.gsub!(/[âãäàá]/,"a")
    line.gsub!(/[Ã?Ã?Ã?Ã?]/,"E")
    line.gsub!(/[êëèé]/,"e")
    line.gsub!(/[�Í�Ï]/,"I")
    line.gsub!(/[îïìí]/,"i")
    line.gsub!(/[Ã?Ã?Ã?Ã?Ã?]/,"O")
    line.gsub!(/[ôõöòó]/,"o")
    line.gsub!(/[Ã?Ã?Ã?Ã?]/,"U")
    line.gsub!(/[ûüùú]/,"u")
    line.gsub!(/Ý/,"Y")
    line.gsub!(/ý/,"y")
    line.gsub!(/ñ/,"n")

Is there an easier/better way to do this?
0b561a629b87f0bbf71b45ee5a48febb?d=identicon&s=25 Dave Burt (Guest)
on 2006-03-17 05:10
(Received via mailing list)
Paul Barry wrote:
> I would like to remove the accents marks (a.k.a diacritical marks) from a
> String.  Assuming "line" is a String, this gets most of them:
>
>     line.gsub!(/[ÀÁÂÃÄ]/,"A")
> ...
> Is there an easier/better way to do this?

Yes. There's a potential problem with your way: if the accented
characters
are more than one byte (i.e. in any character set other than ASCII) each
byte will be replaced with an A:  "À" => "AA".

This is safer: line.gsub!(/À|Á|Â|Ã|Ä]/,"A")

I translated a method to do this from PHP earlier this year:
http://tinyurl.com/q8hlg [Google Groups]

Cheers,
Dave
2abf5beb51d5d66211d525a72c5cb39d?d=identicon&s=25 Paul Battley (Guest)
on 2006-03-17 10:50
(Received via mailing list)
> I translated a method to do this from PHP earlier this year:
> http://tinyurl.com/q8hlg [Google Groups]

Here's a simpler version (hard-coded for UTF-8; it would need some
tweaking for other encodings). It has a side effect of transliterating
punctuation to ASCII as well, which may or may not be desirable.

Paul

----

$KCODE = 'u'
require 'iconv'

class String
  def strip_diacritics
    self.gsub(/[^\x20-\x7f]/){
      Iconv.iconv('us-ascii//IGNORE//TRANSLIT', 'utf-8',
$&)[0].sub(/^[\^`'"~](?=[a-z])/i, '')
    }
  end
end

require 'test/unit'
class TestStripDiacritics < Test::Unit::TestCase

  def test_upper_case
    assert_equal('AAAAA', 'ÀÁÂÃÄ'.strip_diacritics)
    assert_equal('EEEE', 'ÈÉÊË'.strip_diacritics)
    assert_equal('IIII', 'ÌÍÎÏ'.strip_diacritics)
    assert_equal('OOOOO', 'ÒÓÔÕÖ'.strip_diacritics)
    assert_equal('UUUU', 'ÙÚÛÜ'.strip_diacritics)
    assert_equal('Y', 'Ý'.strip_diacritics)
    assert_equal('N', 'Ñ'.strip_diacritics)
  end

  def test_lower_case
    assert_equal('aaaaa', 'âãäàá'.strip_diacritics)
    assert_equal('eeee', 'êëèé'.strip_diacritics)
    assert_equal('iiii', 'îïìí'.strip_diacritics)
    assert_equal('ooooo', 'ôõöòó'.strip_diacritics)
    assert_equal('uuuu', 'ûüùú'.strip_diacritics)
    assert_equal('y', 'ý'.strip_diacritics)
    assert_equal('n', 'ñ'.strip_diacritics)
  end

  def test_words
    assert_equal('Internationalizaetion',
'Iñtërnâtiônàlizætiøn'.strip_diacritics)
  end

  def test_punctuation
    assert_equal('-', '?'.strip_diacritics)
    assert_equal("''", "''".strip_diacritics)
  end
end
This topic is locked and can not be replied to.