Greetings,
(using acts_as_ferret)
So I have a book title "Möngrel „Horsemen“" in my index.
Searching for "Möngrel" retrieves the document.
But I would like searching for "Mongrel" to also retrieve the document.
Which it does not currently.
Anyone have any good solutions to this problem?
I suppose I could filter the documents and queries first which something
like:
(Iconv.new('US-ASCII//TRANSLIT', 'utf-8').iconv "Möngrel
„Horsemen“").gsub(/[^a-zA-Z0-9/im,"")
But perhaps there is a better, or built in solution.
Thanks
on 19.01.2007 18:12
on 22.01.2007 14:55
On Fri, Jan 19, 2007 at 06:12:12PM +0100, John Private wrote: > > Anyone have any good solutions to this problem? > > I suppose I could filter the documents and queries first which something > like: > > > (Iconv.new('US-ASCII//TRANSLIT', 'utf-8').iconv "Möngrel > „Horsemen“").gsub(/[^a-zA-Z0-9/im,"") > > But perhaps there is a better, or built in solution. I don't think so - a custom Analyzer would be the right place for this. Jens -- webit! Gesellschaft für neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Krämer kraemer@webit.de Schnorrstraße 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
on 22.01.2007 16:27
On Jan 22, 2007, at 2:49 PM, Jens Kraemer wrote: >> document. >> „Horsemen“").gsub(/[^a-zA-Z0-9/im,"") >> >> But perhaps there is a better, or built in solution. > > I don't think so - a custom Analyzer would be the right place for > this. We use a normalizer to store/query (to be revised for Rails 1.2): # Utility method that retursn an ASCIIfied, downcased, and sanitized string. # It relies on the Unicode Hacks plugin by means of String#chars. We assume # $KCODE is 'u' in environment.rb. By now we support a wide range of latin # accented letters, based on the Unicode Character Palette bundled in Macs. def self.normalize(str) n = str.chars.downcase.strip.to_s n.gsub!(/[àáâãäåāă]/, 'a') n.gsub!(/æ/, 'ae') n.gsub!(/[ďđ]/, 'd') n.gsub!(/[çćčĉċ]/, 'c') n.gsub!(/[èéêëēęěĕė]/, 'e') n.gsub!(/ƒ/, 'f') n.gsub!(/[ĝğġģ]/, 'g') n.gsub!(/[ĥħ]/, 'h') n.gsub!(/[ììíîïīĩĭ]/, 'i') n.gsub!(/[įıijĵ]/, 'j') n.gsub!(/[ķĸ]/, 'k') n.gsub!(/[łľĺļŀ]/, 'l') n.gsub!(/[ñńňņʼnŋ]/, 'n') n.gsub!(/[òóôõöøōőŏŏ]/, 'o') n.gsub!(/œ/, 'oe') n.gsub!(/ą/, 'q') n.gsub!(/[ŕřŗ]/, 'r') n.gsub!(/[śšşŝș]/, 's') n.gsub!(/[ťţŧț]/, 't') n.gsub!(/[ùúûüūůűŭũų]/, 'u') n.gsub!(/ŵ/, 'w') n.gsub!(/[ýÿŷ]/, 'y') n.gsub!(/[žżź]/, 'z') n.gsub!(/\s+/, ' ') n.gsub!(/[^\sa-z0-9_-]/, '') n end And this convenience class method to use in Rails models with acts_as_ferret (slightly edited): # Wrapper function to normalize fields before calling acts_as_ferret # # Usage: index_fields [:field1, :field2], :option1 => ..., :option2 => ... # # Please note that your queries should use a "_normalized" suffix on # each field, i.e: +field1_normalized:foo class ActiveRecord::Base def self.index_fields(fields, *options) aaf_fields = [] fields.each do |f| class_eval <<-EOS def #{f}_normalized MyAppUtils.normalize(#{f}) end EOS aaf_fields.push ":#{f}_normalized" end aaf_call = 'acts_as_ferret :fields => [' + aaf_fields.join (',') + ']' options.each do |option_pair| option_pair.each do |key, value| aaf_call << ", :#{key} => #{value}" end end logger.info aaf_call class_eval(aaf_call) end end -- fxn
on 24.02.2007 13:58
On 1/23/07, Xavier Noria <fxn@hashref.com> wrote: > >> > >> > # Utility method that retursn an ASCIIfied, downcased, and > n.gsub!(/æ/, 'ae') > n.gsub!(/[ñńňņʼnŋ]/, 'n') > n.gsub!(/\s+/, ' ') > => ..., :option2 => ... > end > logger.info aaf_call > class_eval(aaf_call) > end > end > > -- fxn Sorry to bring this one back from the archives (I'm going through all the email I've missed in my long absence). Anyway, I thought that since not even Jens knew about this I should point out the existence of MappingFilter: http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html It essentially does the same thing as Xavier's code above but it is much faster. It compiles the mappings to a single deterministic finite automaton (DFA): http://en.wikipedia.org/wiki/Deterministic_finite_state_machine Basically, this means the filter does a single pass through the string to do all the mappings rather than a pass for each mapping. Hope that helps somebody, Dave