Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
‘ñ’ becomes n and ‘â’ becomes a) and drop non-alphanumeric
characters (’!etter’ becomes ‘etter’).
Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.
Cheers,
Thiago A.
Thiago A. wrote:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
‘ñ’ becomes n and ‘â’ becomes a) and drop non-alphanumeric
characters (‘!etter’ becomes ‘etter’).
Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.
You could use the “unicode” library, by Yoshida Masato.
http://www.yoshidam.net/Ruby.html
Example:
$ cat uni.rb
require ‘unicode’
txt = ‘ñ÷åòôùõéïðÁÓÄÆÇÈÊËÌ!@#*$%^&’
puts Unicode.decompose(txt).delete(‘^0-9A-Za-z’)
$ ruby uni.rb
naoouoeiAOACEEEI
Good luck.
On Feb 13, 2007, at 7:00 PM, Thiago A. wrote:
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
‘ñ’ becomes n and ‘â’ becomes a) and drop non-alphanumeric
characters (’!etter’ becomes ‘etter’).
Those are some pretty simple functions that I could write myself
(actually I already have), but it would be nice to use some better
tested code.
The best approach I’ve seen[*] is to decompose and map to ASCII:
Iconv.iconv(‘ascii//ignore//translit’, ‘utf-8’, str)
and then sanitize.
I think this is better than the technique that passes through Unicode
decomposition because it also handles ß (ss), € (EUR), æ (ae), œ
(oe), etc.
– fxn
[*] Seen in the source of the Rails plugin acts_as_friendly_param,
which in turn takes the idea from Mephisto.
Has anyone seen a non-english characters library for Ruby walking
around? For now, I need to remove letter decorators (in other words,
‘ñ’ becomes n and ‘â’ becomes a) and drop non-alphanumeric
characters (‘!etter’ becomes ‘etter’).
I thought iconv transliteration might do this, but it doesn’t:
require ‘iconv’
i = Iconv.new('ascii//TRANSLIT//IGNORE, ‘iso-8859-1’)
i.iconv(‘ñ’)
=> “?”
A bit of googling turns up the following:
Both work roughly the same way; they use iconv to convert the source
string to UTF-16BE, followed by a mapping table to map accented
characters to their non-accented equivalents.
The unac link above has a bit more information about how these mapping
tables are generated; basically they have a script that parses a unicode
data file at build time and generates the mapping table.
The mapping table is available here:
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
So anyway, the answers to your question appears to be:
- If you’re just converting a couple of characters one time, just
use a regular expression.
- If you’re looking to convert an arbitrary number of characters one
time and have access to a machine with GNU tools, just use unac.
- If you’re not particular about the language, use the Perl library.
- If you don’t mind installing the unac library, I wrote a quick wrapper
for it. See below for more.
This has been discussed on ruby-talk before:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/96626
I wrote a quick binding for the unac library, you can grab it from here:
http://pablotron.org/files/unac-ruby-0.1.0.tar.gz (tarball)
http://pablotron.org/files/unac-ruby-0.1.0.tar.gz.asc (PGP Signature)
http://hg.pablotron.org/unac-ruby (Mercurial Repository)
If you’re interested, I can probably write a pure-Ruby version
relatively quickly too.
On Feb 13, 2007, at 8:21 PM, Paul D. wrote:
i.iconv(‘ñ’)
=> “?”
Looks like your source code was not iso-8859-1, because it works:
require ‘iconv’
puts Iconv.iconv(‘ascii//ignore//translit’, ‘iso-8859-1’,
“ñ”) => ~n
It works in UTF8 as well:
$KCODE = ‘u’
require ‘iconv’
puts Iconv.iconv(‘ascii//ignore//translit’, ‘utf-8’,
“ñ”) => ~n
– fxn