Dealing with accented characters

gwendydd · September 29, 2008, 10:26pm

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

My database has a lot of names in it that have characters with accents
and other non-keyboard characters. When users search the database, I
would like for them to be able to find records with accented characters
even if they don’t type in the accent. For instance, a user might be
searching for a text by the author ChrÃ©tien de Troyes. Right now, they
have to type “ChrÃ©tien” into the search form to find him: I would like
for a search for “Chretien” to also find “ChrÃ©tien.”

This strikes me as a rather common problem: is there a good solution for
it?

gwendydd · September 29, 2008, 10:33pm

On 29 Sep 2008, at 21:26, Morgan K. <rails-mailing-list@andreas-
s.net> wrote:

Is there any way to write a search function that can search for words
that contain accented characters when the user types in words without
accented characters?

Play with your database collation settings

Fred

gwendydd · October 1, 2008, 6:39pm

Quoting Frederick C. [email protected]:

Play with your database collation settings

Fred

This works if you have only one language. With multiple languages, you
need
to keep the locale, switching as needed. Generic Latin1 may do what you
need.

Jeffrey

gwendydd · October 1, 2008, 8:48pm

I have been reading up on collation settings, and I’m not sure it will
do what I want. I don’t want to get rid of accented characters (which
is what would happen if I changed character sets), I just don’t want
searches to get thrown off by them. In other words, a search for
“ChrÃ©tien” or “Chretien” should still find “ChrÃ©tien”, and he should
still have the accent in his name. Can collation settings do this for
me, or is there some other solution?

Thanks!

gwendydd · October 1, 2008, 9:02pm

On Oct 1, 2008, at 8:48 PM, Morgan K. wrote:

I have been reading up on collation settings, and I’m not sure it will
do what I want. I don’t want to get rid of accented characters (which
is what would happen if I changed character sets), I just don’t want
searches to get thrown off by them. In other words, a search for
“ChrÃ©tien” or “Chretien” should still find “ChrÃ©tien”, and he should
still have the accent in his name. Can collation settings do this for
me, or is there some other solution?

One approach is to transliterate your input, e.g.:

http://interglacial.com/~sburke/tpj/as_html/tpj22.html
– Sean M. Burke, Unidecode!, 2001

That way, “ChrÃ©tien” becomes “chretien” or some such for the purpose
of your search, but remains “ChrÃ©tien” in the text.

For example, both El-AaiÃºn and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-AaiÃºn
http://svr225.stepx.com:3388/El-Aaiun

Cheers,

–
PA.
http://alt.textdrive.com/nanoki/

gwendydd · October 2, 2008, 2:25am

One approach is to transliterate your input, e.g.:

Unidecode!
– Sean M. Burke, Unidecode!, 2001

That way, “ChrÃ©tien” becomes “chretien” or some such for the purpose
of your search, but remains “ChrÃ©tien” in the text.

For example, both El-AaiÃºn and El-Aaiun could reference the same
underlying text:

http://svr225.stepx.com:3388/El-AaiÃºn
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a while, I
don’t see how to get it to work with Rails… could you give me a few
pointers or direct me to some documentation?

Thank you!!

gwendydd · October 2, 2008, 8:00pm

Petite A. [2008-10-02 19:56]:

At its core, Unidecode is simply a lookup table. Should be rather
straightforward to port to Ruby if it hasn’t been done already.
i wanted to do it, but it’s been there for over a year now:

http://rubyforge.org/projects/unidecode

cheers
jens

gwendydd · October 2, 2008, 7:56pm

On Oct 2, 2008, at 2:25 AM, Morgan K. wrote:

underlying text:

http://svr225.stepx.com:3388/El-AaiÃºn
http://svr225.stepx.com:3388/El-Aaiun

This looks really promising, but after reading up on this for a
while, I
don’t see how to get it to work with Rails… could you give me a few
pointers or direct me to some documentation?

At its core, Unidecode is simply a lookup table. Should be rather
straightforward to port to Ruby if it hasn’t been done already.

Here is the original Perl implementation:

And bellow is a Lua port of it:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode.lua

As well as the lookup table themselves:

http://dev.alt.textdrive.com/browser/HTTP/Unidecode

Usage example:

local Unidecode = require( ‘Unidecode’ )

print( 1, ‘ÐœÐ¾ÑÐºÐ²Ð°Ì’, Unidecode( ‘ÐœÐ¾ÑÐºÐ²Ð°Ì’ ) )
print( 2, ‘åŒ—äº¬’, Unidecode( ‘åŒ—äº¬’ ) )
print( 3, ‘á¼ˆÎ¸Î·Î½á¾¶’, Unidecode( ‘á¼ˆÎ¸Î·Î½á¾¶’ ) )
print( 4, ‘ì„œìš¸’, Unidecode( ‘ì„œìš¸’ ) )
print( 5, ‘æ±äº¬’, Unidecode( ‘æ±äº¬’ ) )
print( 6, ‘äº¬éƒ½å¸‚’, Unidecode( ‘äº¬éƒ½å¸‚’ ) )
print( 7, ‘à¤¨à¥‡à¤ªà¤¾à¤²’, Unidecode( ‘à¤¨à¥‡à¤ªà¤¾à¤²’ ) )

1 ÐœÐ¾ÑÐºÐ²Ð°Ì Moskva
2 åŒ—äº¬ beijing
3 á¼ˆÎ¸Î·Î½á¾¶ Athena
4 ì„œìš¸ seoul
5 æ±äº¬ dongjing
6 äº¬éƒ½å¸‚ jingdushi
7 à¤¨à¥‡à¤ªà¤¾à¤² nepaal

If Unidecode is too much of a good thing, one could use iconv translit
or such, e.g. iconv( ‘utf-8’, ‘us-ascii//TRANSLIT’ )…

One way or another, the crux of it is to transliterate your data as
well as you query. And then use the later to search the former.

Cheers,

–
PA.
http://alt.textdrive.com/nanoki/