Dealing with accented characters

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

My database has a lot of names in it that have characters with accents
and other non-keyboard characters. When users search the database, I
would like for them to be able to find records with accented characters
even if they don’t type in the accent. For instance, a user might be
searching for a text by the author Chrétien de Troyes. Right now, they
have to type “Chrétien” into the search form to find him: I would like
for a search for “Chretien” to also find “Chrétien.”

This strikes me as a rather common problem: is there a good solution for
it?

On 29 Sep 2008, at 21:26, Morgan K. <rails-mailing-list@andreas-
s.net> wrote:

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

Play with your database collation settings

Fred

Quoting Frederick C. [email protected]:

Play with your database collation settings

Fred

This works if you have only one language. With multiple languages, you
need
to keep the locale, switching as needed. Generic Latin1 may do what you
need.

Jeffrey

I have been reading up on collation settings, and I’m not sure it will
do what I want. I don’t want to get rid of accented characters (which
is what would happen if I changed character sets), I just don’t want
searches to get thrown off by them. In other words, a search for
“Chrétien” or “Chretien” should still find “Chrétien”, and he should
still have the accent in his name. Can collation settings do this for
me, or is there some other solution?

Thanks!

On Oct 1, 2008, at 8:48 PM, Morgan K. wrote:

I have been reading up on collation settings, and I’m not sure it will
do what I want. I don’t want to get rid of accented characters (which
is what would happen if I changed character sets), I just don’t want
searches to get thrown off by them. In other words, a search for
“Chrétien” or “Chretien” should still find “Chrétien”, and he should
still have the accent in his name. Can collation settings do this for
me, or is there some other solution?

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html
– Sean M. Burke, Unidecode!, 2001

That way, “Chrétien” becomes “chretien” or some such for the purpose
of your search, but remains “Chrétien” in the text.

For example, both El-Aaiún and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

Cheers,


PA.
http://alt.textdrive.com/nanoki/

One approach is to transliterate your input, e.g.:

Unidecode!
– Sean M. Burke, Unidecode!, 2001

That way, “Chrétien” becomes “chretien” or some such for the purpose
of your search, but remains “Chrétien” in the text.

For example, both El-Aaiún and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I
don’t see how to get it to work with Rails… could you give me a few
pointers or direct me to some documentation?

Thank you!!

Petite A. [2008-10-02 19:56]:

At its core, Unidecode is simply a lookup table. Should be rather
straightforward to port to Ruby if it hasn’t been done already.
i wanted to do it, but it’s been there for over a year now:

http://rubyforge.org/projects/unidecode

cheers
jens

On Oct 2, 2008, at 2:25 AM, Morgan K. wrote:

underlying text:

http://svr225.stepx.com:3388/El-Aaiún
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a
while, I
don’t see how to get it to work with Rails… could you give me a few
pointers or direct me to some documentation?

At its core, Unidecode is simply a lookup table. Should be rather
straightforward to port to Ruby if it hasn’t been done already.

Here is the original Perl implementation:

And bellow is a Lua port of it:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode.lua

As well as the lookup table themselves:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode

Usage example:

local Unidecode = require( ‘Unidecode’ )

print( 1, ‘Москва́’, Unidecode( ‘Москва́’ ) )
print( 2, ‘北京’, Unidecode( ‘北京’ ) )
print( 3, ‘Ἀθηνᾶ’, Unidecode( ‘Ἀθηνᾶ’ ) )
print( 4, ‘서울’, Unidecode( ‘서울’ ) )
print( 5, ‘東京’, Unidecode( ‘東京’ ) )
print( 6, ‘京都市’, Unidecode( ‘京都市’ ) )
print( 7, ‘नेपाल’, Unidecode( ‘नेपाल’ ) )

1 Москва́ Moskva
2 北京 beijing
3 Ἀθηνᾶ Athena
4 서울 seoul
5 東京 dongjing
6 京都市 jingdushi
7 नेपाल nepaal

If Unidecode is too much of a good thing, one could use iconv translit
or such, e.g. iconv( ‘utf-8’, ‘us-ascii//TRANSLIT’ )…

One way or another, the crux of it is to transliterate your data as
well as you query. And then use the later to search the former.

Cheers,


PA.
http://alt.textdrive.com/nanoki/