International character search

soulhunter · October 12, 2006, 4:24am

Hi!

I’m working on a program, and I need to do case-insensitive search with
international characters on it, like:

ñáéíóúàèìòùäëïöü and so on.

Anyway, I found a way of implementing it, but I don’t quite like it
because it would implies create the autocomplete function for each
autocomplete I have in my project.

The way of doing so I found is to change the condition from:

LOWER(column) like ‘%thing_downcased%’

to

column ~* ‘thing_downcased’

and replacing the international characters for the [ñÑ] kind of
expression, like this:

name ~* ‘la [ñÑ]apa’

and it actually works (at least with postgresql), but then, I would
need to do the substitution everytime I do a search, and I would need
to reimplement the autocomplete function for each autocompletion with
the new schema.

Any better idea?,

Sincerely,

Ildefonso Camargo

soulhunter · October 18, 2006, 10:39am

On Oct 12, 2006, at 4:19 AM, soulhunter wrote:

Anyway, I found a way of implementing it, but I don’t quite like it
because it would implies create the autocomplete function for each
autocomplete I have in my project.

The way of doing so I found is to change the condition from:

LOWER(column) like ‘%thing_downcased%’

Just to share a different approach, since you can’t expect users to
type accented words correctly, I usually store a normalized extra
column (say name_normalized) for searches maintained in some Rails-
way like filters, or store just the normalization of them in ferret.
Then any query has to be normalized.

– fxn

Utility method that retursn an ASCIIfied, downcased, and

sanitized string.

It relies on the Unicode Hacks plugin by means of String#chars.

We assume

$KCODE is ‘u’ in environment.rb. By now we support a wide range

of latin

accented letters, based on the Unicode Character Palette bundled

in Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[Ã Ã¡Ã¢Ã£Ã¤Ã¥ÄÄ?Ä?]/u, ‘a’)
n.gsub!(/\s+/, ’ ')
n.gsub!(/Ã¦/u, ‘ae’)
n.gsub!(/[ÄÄ?]/u, ‘d’)
n.gsub!(/[Ã§Ä?ÄÄ?Ä?]/u, ‘c’)
n.gsub!(/[Ã¨Ã©ÃªÃ«Ä?Ä?Ä?Ä?Ä?]/u, ‘e’)
n.gsub!(/Æ?/u, ‘f’)
n.gsub!(/[ÄÄ?Ä¡Ä£]/u, ‘g’)
n.gsub!(/[Ä¥Ä§]/, ‘h’)
n.gsub!(/[Ã¬Ã¬ÃÃ®Ã¯Ä«Ä©Ä]/u, ‘i’)
n.gsub!(/[Ä¯Ä±Ä³Äµ]/u, ‘j’)
n.gsub!(/[Ä·Ä¸]/u, ‘k’)
n.gsub!(/[Å?Ä¾ÄºÄ¼Å?]/u, ‘l’)
n.gsub!(/[Ã±Å?Å?Å?Å?Å?]/u, ‘n’)
n.gsub!(/[Ã²Ã³Ã´ÃµÃ¶Ã¸ÅÅ?ÅÅ]/u, ‘o’)
n.gsub!(/Å?/u, ‘oe’)
n.gsub!(/[Å?Å?Å?]/u, ‘r’)
n.gsub!(/[Å?Å¡Å?ÅÈ?]/u, ‘s’)
n.gsub!(/[Å¥Å£Å§È?]/u, ‘t’)
n.gsub!(/[Ã¹ÃºÃ»Ã¼Å«Å¯Å±ÅÅ©Å³]/u, ‘u’)
n.gsub!(/Åµ/u, ‘w’)
n.gsub!(/[Ã½Ã¿Å·]/u, ‘y’)
n.gsub!(/[Å¾Å¼Åº]/u, ‘z’)
n.gsub!(/[^\sa-z0-9_-]/, ‘’)
n
end

soulhunter · October 18, 2006, 10:39am

On 10/13/06, Xavier N. [email protected] wrote:

It relies on the Unicode Hacks plugin by means of String#chars.
 n.gsub!(/[ÄÄ?]/u,          'd')
 n.gsub!(/[Ã²Ã³Ã´ÃµÃ¶Ã¸ÅÅ?ÅÅ]/u,  'o')
end

Sweet! I’ve just been looking for a character conversion chart like
this to add a filter to Ferret. In a future version of Ferret (coming
very soon) this will be a lot easier and faster. I’ll probably put an
option on the StandardAnalyzer called :normalize_unicode or something.

Thanks Xavier,
Dave

soulhunter · October 18, 2006, 10:39am

On Oct 13, 2006, at 2:47 AM, David B. wrote:

Sweet! I’ve just been looking for a character conversion chart like
this to add a filter to Ferret. In a future version of Ferret (coming
very soon) this will be a lot easier and faster. I’ll probably put an
option on the StandardAnalyzer called :normalize_unicode or something.

Excelent!

I noticed in the mail that a q-like character was among the a-like
character class, I moved that out and send the normalizer again for
the archives:

Utility method that retursn an ASCIIfied, downcased, and sanitized

string.

It relies on the Unicode Hacks plugin by means of String#chars. We

assume

$KCODE is ‘u’ in environment.rb. By now we support a wide range of

latin

accented letters, based on the Unicode Character Palette bundled in

Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[Ã Ã¡Ã¢Ã£Ã¤Ã¥ÄÄ?]/u, ‘a’)
n.gsub!(/Ã¦/u, ‘ae’)
n.gsub!(/[ÄÄ?]/u, ‘d’)
n.gsub!(/[Ã§Ä?ÄÄ?Ä?]/u, ‘c’)
n.gsub!(/[Ã¨Ã©ÃªÃ«Ä?Ä?Ä?Ä?Ä?]/u, ‘e’)
n.gsub!(/Æ?/u, ‘f’)
n.gsub!(/[ÄÄ?Ä¡Ä£]/u, ‘g’)
n.gsub!(/[Ä¥Ä§]/, ‘h’)
n.gsub!(/[Ã¬Ã¬ÃÃ®Ã¯Ä«Ä©Ä]/u, ‘i’)
n.gsub!(/[Ä¯Ä±Ä³Äµ]/u, ‘j’)
n.gsub!(/[Ä·Ä¸]/u, ‘k’)
n.gsub!(/[Å?Ä¾ÄºÄ¼Å?]/u, ‘l’)
n.gsub!(/[Ã±Å?Å?Å?Å?Å?]/u, ‘n’)
n.gsub!(/[Ã²Ã³Ã´ÃµÃ¶Ã¸ÅÅ?ÅÅ]/u, ‘o’)
n.gsub!(/Å?/u, ‘oe’)
n.gsub!(/Ä?/u, ‘q’)
n.gsub!(/[Å?Å?Å?]/u, ‘r’)
n.gsub!(/[Å?Å¡Å?ÅÈ?]/u, ‘s’)
n.gsub!(/[Å¥Å£Å§È?]/u, ‘t’)
n.gsub!(/[Ã¹ÃºÃ»Ã¼Å«Å¯Å±ÅÅ©Å³]/u, ‘u’)
n.gsub!(/Åµ/u, ‘w’)
n.gsub!(/[Ã½Ã¿Å·]/u, ‘y’)
n.gsub!(/[Å¾Å¼Åº]/u, ‘z’)
n.gsub!(/\s+/, ’ ')
n.gsub!(/[^\sa-z0-9_-]/, ‘’)
n
end

– fxn