Forum: Ruby Non-English characters

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B1d59a804bd67487c964bc505a8eb892?d=identicon&s=25 Thiago Arrais (Guest)
on 2007-02-13 19:01
(Received via mailing list)
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
characters ('!etter' becomes 'etter').

Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.

Cheers,

Thiago Arrais
C2cd72c24873556e278b44b5b3c7ef33?d=identicon&s=25 Carlos (Guest)
on 2007-02-13 19:35
(Received via mailing list)
Thiago Arrais wrote:
> Has anyone seen a non-english characters library for Ruby walking
> around? For now, I need to remove letter decorators (in other words,
> 'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
> characters ('!etter' becomes 'etter').
>
> Those are some pretty simple functions that I could write myself
> (actually I already have), but it would be nice to use some better
> tested code.

You could use the "unicode" library, by  Yoshida Masato.
   http://www.yoshidam.net/Ruby.html

Example:

$ cat uni.rb
require 'unicode'
txt = 'ñ÷åòôùõéïðÁÓÄÆÇÈÊËÌ!@#*$%^&'
puts Unicode.decompose(txt).delete('^0-9A-Za-z')

$ ruby uni.rb
naoouoeiAOACEEEI


Good luck.
7223c62b7310e164eb79c740188abbda?d=identicon&s=25 Xavier Noria (Guest)
on 2007-02-13 19:59
(Received via mailing list)
On Feb 13, 2007, at 7:00 PM, Thiago Arrais wrote:

> Has anyone seen a non-english characters library for Ruby walking
> around? For now, I need to remove letter decorators (in other words,
> 'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
> characters ('!etter' becomes 'etter').
>
> Those are some pretty simple functions that I could write myself
> (actually I already have), but it would be nice to use some better
> tested code.

The best approach I've seen[*] is to decompose and map to ASCII:

   Iconv.iconv('ascii//ignore//translit', 'utf-8', str)

and then sanitize.

I think this is better than the technique that passes through Unicode
decomposition because it also handles ß (ss), € (EUR), æ (ae), œ
(oe), etc.

-- fxn

[*] Seen in the source of the Rails plugin acts_as_friendly_param,
which in turn takes the idea from Mephisto.
Ff260830c27224f0e15f37362a6256d0?d=identicon&s=25 Paul Duncan (Guest)
on 2007-02-13 20:21
(Received via mailing list)
* Thiago Arrais (thiago.arrais@gmail.com) wrote:
> Has anyone seen a non-english characters library for Ruby walking
> around? For now, I need to remove letter decorators (in other words,
> 'ñ' becomes n and 'â' becomes a) and drop non-alphanumeric
> characters ('!etter' becomes 'etter').

I thought iconv transliteration might do this, but it doesn't:

  require 'iconv'
  i = Iconv.new('ascii//TRANSLIT//IGNORE, 'iso-8859-1')
  i.iconv('ñ')
  => "?"

A bit of googling turns up the following:

* Text::Unaccent, a Perl module available via CPAN
  (http://search.cpan.org/~ldachary/Text-Unaccent-1.0...)
* unac, a GNU utility (and library) that removes accents from
  characters. (http://home.gna.org/unac/unac-man3.en.html)

Both work roughly the same way; they use iconv to convert the source
string to UTF-16BE, followed by a mapping table to map accented
characters to their non-accented equivalents.

The unac link above has a bit more information about how these mapping
tables are generated; basically they have a script that parses a unicode
data file at build time and generates the mapping table.

The mapping table is available here:
  http://www.unicode.org/Public/3.2-Update/UnicodeDa...

So anyway, the answers to your question appears to be:

* If you're just converting a couple of characters one time, just
  use a regular expression.
* If you're looking to convert an arbitrary number of characters one
  time and have access to a machine with GNU tools, just use unac.
* If you're not particular about the language, use the Perl library.
* If you don't mind installing the unac library, I wrote a quick wrapper
  for it.  See below for more.

This has been discussed on ruby-talk before:

  http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/...

I wrote a quick binding for the unac library, you can grab it from here:

  http://pablotron.org/files/unac-ruby-0.1.0.tar.gz (tarball)
  http://pablotron.org/files/unac-ruby-0.1.0.tar.gz.asc (PGP Signature)
  http://hg.pablotron.org/unac-ruby (Mercurial Repository)

If you're interested, I can probably write a pure-Ruby version
relatively quickly too.
7223c62b7310e164eb79c740188abbda?d=identicon&s=25 Xavier Noria (Guest)
on 2007-02-13 21:05
(Received via mailing list)
On Feb 13, 2007, at 8:21 PM, Paul Duncan wrote:

>   i.iconv('ñ')
>   => "?"
Looks like your source code was not iso-8859-1, because it works:

   require 'iconv'
   puts Iconv.iconv('ascii//ignore//translit', 'iso-8859-1',
"ñ")   => ~n

It works in UTF8 as well:

   $KCODE = 'u'
   require 'iconv'

   puts Iconv.iconv('ascii//ignore//translit', 'utf-8',
"ñ")   => ~n

-- fxn
This topic is locked and can not be replied to.