How to have 'o' == 'Ã¶'

johnpetterson · January 19, 2007, 6:12pm

Greetings,

(using acts_as_ferret)

So I have a book title “MÃ¶ngrel â€žHorsemenâ€œ” in my index.

Searching for “MÃ¶ngrel” retrieves the document.

But I would like searching for “Mongrel” to also retrieve the document.
Which it does not currently.

Anyone have any good solutions to this problem?

I suppose I could filter the documents and queries first which something
like:

(Iconv.new(‘US-ASCII//TRANSLIT’, ‘utf-8’).iconv “MÃ¶ngrel
â€žHorsemenâ€œ”).gsub(/[^a-zA-Z0-9/im,"")

But perhaps there is a better, or built in solution.

Thanks

johnpetterson · January 22, 2007, 2:55pm

On Fri, Jan 19, 2007 at 06:12:12PM +0100, John Private wrote:

Anyone have any good solutions to this problem?

I suppose I could filter the documents and queries first which something
like:

(Iconv.new(‘US-ASCII//TRANSLIT’, ‘utf-8’).iconv “MÃ¶ngrel
â€žHorsemenâ€œ”).gsub(/[^a-zA-Z0-9/im,“”)

But perhaps there is a better, or built in solution.

I don’t think so - a custom Analyzer would be the right place for
this.

Jens

–
webit! Gesellschaft fÃ¼r neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens KrÃ¤mer [email protected]
SchnorrstraÃŸe 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

johnpetterson · January 22, 2007, 4:27pm

On Jan 22, 2007, at 2:49 PM, Jens K. wrote:

document.
â€žHorsemenâ€œ").gsub(/[^a-zA-Z0-9/im,"")

But perhaps there is a better, or built in solution.

I don’t think so - a custom Analyzer would be the right place for
this.

We use a normalizer to store/query (to be revised for Rails 1.2):

Utility method that retursn an ASCIIfied, downcased, and

sanitized string.

It relies on the Unicode Hacks plugin by means of String#chars.

We assume

$KCODE is ‘u’ in environment.rb. By now we support a wide range

of latin

accented letters, based on the Unicode Character Palette bundled

in Macs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[Ã Ã¡Ã¢Ã£Ã¤Ã¥ÄÄƒ]/, ‘a’)
n.gsub!(/Ã¦/, ‘ae’)
n.gsub!(/[ÄÄ‘]/, ‘d’)
n.gsub!(/[Ã§Ä‡ÄÄ‰Ä‹]/, ‘c’)
n.gsub!(/[Ã¨Ã©ÃªÃ«Ä“Ä™Ä›Ä•Ä—]/, ‘e’)
n.gsub!(/Æ’/, ‘f’)
n.gsub!(/[ÄÄŸÄ¡Ä£]/, ‘g’)
n.gsub!(/[Ä¥Ä§]/, ‘h’)
n.gsub!(/[Ã¬Ã¬ÃÃ®Ã¯Ä«Ä©Ä]/, ‘i’)
n.gsub!(/[Ä¯Ä±Ä³Äµ]/, ‘j’)
n.gsub!(/[Ä·Ä¸]/, ‘k’)
n.gsub!(/[Å‚Ä¾ÄºÄ¼Å€]/, ‘l’)
n.gsub!(/[Ã±Å„ÅˆÅ†Å‰Å‹]/, ‘n’)
n.gsub!(/[Ã²Ã³Ã´ÃµÃ¶Ã¸ÅÅ‘ÅÅ]/, ‘o’)
n.gsub!(/Å“/, ‘oe’)
n.gsub!(/Ä…/, ‘q’)
n.gsub!(/[Å•Å™Å—]/, ‘r’)
n.gsub!(/[Å›Å¡ÅŸÅÈ™]/, ‘s’)
n.gsub!(/[Å¥Å£Å§È›]/, ‘t’)
n.gsub!(/[Ã¹ÃºÃ»Ã¼Å«Å¯Å±ÅÅ©Å³]/, ‘u’)
n.gsub!(/Åµ/, ‘w’)
n.gsub!(/[Ã½Ã¿Å·]/, ‘y’)
n.gsub!(/[Å¾Å¼Åº]/, ‘z’)
n.gsub!(/\s+/, ’ ')
n.gsub!(/[^\sa-z0-9_-]/, ‘’)
n
end

And this convenience class method to use in Rails models with
acts_as_ferret (slightly edited):

Wrapper function to normalize fields before calling acts_as_ferret

Usage: index_fields [:field1, :field2], :option1

=> …, :option2 => …

Please note that your queries should use a “_normalized” suffix on

each field, i.e: +field1_normalized:foo

class ActiveRecord::Base
def self.index_fields(fields, *options)
aaf_fields = []
fields.each do |f|
class_eval <<-EOS
def #{f}_normalized
MyAppUtils.normalize(#{f})
end
EOS
aaf_fields.push “:#{f}_normalized”
end
aaf_call = ‘acts_as_ferret :fields => [’ + aaf_fields.join
(’,’) + ‘]’
options.each do |option_pair|
option_pair.each do |key, value|
aaf_call << “, :#{key} => #{value}”
end
end
logger.info aaf_call
class_eval(aaf_call)
end
end

– fxn

johnpetterson · February 24, 2007, 1:58pm

On 1/23/07, Xavier N. [email protected] wrote:

Utility method that retursn an ASCIIfied, downcased, and
 n.gsub!(/Ã¦/,            'ae')
 n.gsub!(/[Ã±Å„ÅˆÅ†Å‰Å‹]/,      'n')
 n.gsub!(/\s+/,            ' ')
=> …, :option2 => …
end
logger.info aaf_call
class_eval(aaf_call)
end
end

– fxn

Sorry to bring this one back from the archives (I’m going through all
the email I’ve missed in my long absence). Anyway, I thought that
since not even Jens knew about this I should point out the existence
of MappingFilter:

http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html

It essentially does the same thing as Xavier’s code above but it is
much faster. It compiles the mappings to a single deterministic finite
automaton (DFA):

http://en.wikipedia.org/wiki/Deterministic_finite_state_machine

Basically, this means the filter does a single pass through the string
to do all the mappings rather than a pass for each mapping.

Hope that helps somebody,
Dave