How to have 'o' == 'ö'


#1

Greetings,

(using acts_as_ferret)

So I have a book title “Möngrel „Horsemen“” in my index.

Searching for “Möngrel” retrieves the document.

But I would like searching for “Mongrel” to also retrieve the document.
Which it does not currently.

Anyone have any good solutions to this problem?

I suppose I could filter the documents and queries first which something
like:

(Iconv.new(‘US-ASCII//TRANSLIT’, ‘utf-8’).iconv “Möngrel
„Horsemen“”).gsub(/[^a-zA-Z0-9/im,"")

But perhaps there is a better, or built in solution.

Thanks


#2

On Fri, Jan 19, 2007 at 06:12:12PM +0100, John Private wrote:

Anyone have any good solutions to this problem?

I suppose I could filter the documents and queries first which something
like:

(Iconv.new(‘US-ASCII//TRANSLIT’, ‘utf-8’).iconv “Möngrel
„Horsemen“”).gsub(/[^a-zA-Z0-9/im,"")

But perhaps there is a better, or built in solution.

I don’t think so - a custom Analyzer would be the right place for
this.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer removed_email_address@domain.invalid
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66


#3

On Jan 22, 2007, at 2:49 PM, Jens K. wrote:

document.
„Horsemen“").gsub(/[^a-zA-Z0-9/im,"")

But perhaps there is a better, or built in solution.

I don’t think so - a custom Analyzer would be the right place for
this.

We use a normalizer to store/query (to be revised for Rails 1.2):

Utility method that retursn an ASCIIfied, downcased, and

sanitized string.

It relies on the Unicode Hacks plugin by means of String#chars.

We assume

$KCODE is ‘u’ in environment.rb. By now we support a wide range

of latin

accented letters, based on the Unicode Character Palette bundled

in Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[à áâãäåāă]/, ‘a’)
n.gsub!(/æ/, ‘ae’)
n.gsub!(/[ďđ]/, ‘d’)
n.gsub!(/[çćčĉċ]/, ‘c’)
n.gsub!(/[èéêëēęěĕė]/, ‘e’)
n.gsub!(/Æ’/, ‘f’)
n.gsub!(/[ĝğġģ]/, ‘g’)
n.gsub!(/[ĥħ]/, ‘h’)
n.gsub!(/[ììíîïīĩĭ]/, ‘i’)
n.gsub!(/[įıijĵ]/, ‘j’)
n.gsub!(/[ķĸ]/, ‘k’)
n.gsub!(/[łľĺļŀ]/, ‘l’)
n.gsub!(/[ñńňņʼnŋ]/, ‘n’)
n.gsub!(/[òóôõöøōőŏŏ]/, ‘o’)
n.gsub!(/Å“/, ‘oe’)
n.gsub!(/Ä…/, ‘q’)
n.gsub!(/[ŕřŗ]/, ‘r’)
n.gsub!(/[śšşŝș]/, ‘s’)
n.gsub!(/[ťţŧț]/, ‘t’)
n.gsub!(/[ùúûüūůűŭũų]/, ‘u’)
n.gsub!(/ŵ/, ‘w’)
n.gsub!(/[ýÿŷ]/, ‘y’)
n.gsub!(/[žżź]/, ‘z’)
n.gsub!(/\s+/, ’ ')
n.gsub!(/[^\sa-z0-9_-]/, ‘’)
n
end

And this convenience class method to use in Rails models with
acts_as_ferret (slightly edited):

Wrapper function to normalize fields before calling acts_as_ferret

Usage: index_fields [:field1, :field2], :option1

=> …, :option2 => …

Please note that your queries should use a “_normalized” suffix on

each field, i.e: +field1_normalized:foo

class ActiveRecord::Base
def self.index_fields(fields, *options)
aaf_fields = []
fields.each do |f|
class_eval <<-EOS
def #{f}_normalized
MyAppUtils.normalize(#{f})
end
EOS
aaf_fields.push “:#{f}_normalized”
end
aaf_call = ‘acts_as_ferret :fields => [’ + aaf_fields.join
(’,’) + ‘]’
options.each do |option_pair|
option_pair.each do |key, value|
aaf_call << “, :#{key} => #{value}”
end
end
logger.info aaf_call
class_eval(aaf_call)
end
end

– fxn


#4

On 1/23/07, Xavier N. removed_email_address@domain.invalid wrote:

Utility method that retursn an ASCIIfied, downcased, and

 n.gsub!(/æ/,            'ae')
 n.gsub!(/[ñńňņʼnŋ]/,      'n')
 n.gsub!(/\s+/,            ' ')

=> …, :option2 => …
end
logger.info aaf_call
class_eval(aaf_call)
end
end

– fxn

Sorry to bring this one back from the archives (I’m going through all
the email I’ve missed in my long absence). Anyway, I thought that
since not even Jens knew about this I should point out the existence
of MappingFilter:

http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html

It essentially does the same thing as Xavier’s code above but it is
much faster. It compiles the mappings to a single deterministic finite
automaton (DFA):

http://en.wikipedia.org/wiki/Deterministic_finite_state_machine

Basically, this means the filter does a single pass through the string
to do all the mappings rather than a pass for each mapping.

Hope that helps somebody,
Dave