Hi,
I’m working on a Ferret-based application which indexes content in all
European languages. Thus, I have to deal with those funny European
characters.
After googling a bit, I decided to move on with a custom European
analyzer based on MappingFilter, as suggested in the Ferret rdoc.
Everything works fine with Ferret 0.11.3 on Mac OS X.
But this application needs to run on both Windows and Mac OS X. Since
there’s no mswin32 gem for 0.11.3, I decided to downgrade to 0.10.9 and
replace MappingFilter with a custom-made filter as suggested by David in
the following post.
http://www.ruby-forum.com/topic/85299#156036
See the code I wrote at the bottom of this post. The token streams
produced by this analyzer work fine in unit tests but the indexer fails
to use them when a document is added. Here’s the stack trace I get (on
Mac OS X)
wrong argument type Ferret::Analysis::ToASCIIFilter (expected Data)
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
text=' /usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
add_document’
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:277:in
<<' /usr/local/lib/ruby/1.8/monitor.rb:238:in
synchronize’
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.10.9/lib/ferret/index.rb:252:in
`<<’
I tried several variants of the code (like avoid super and inheritance)
but never with success.
Therefore, I’m wondering whether 0.11.3 will be available soon on
windows.
Or if I can build this gem myself (I guess I’ll need a Microsoft C
compiler).
Or if I can do things differently to get a European analyzer with
0.10.9.
Thanks for your help.
Laurent
require ‘ferret’
require ‘jcode’
module Ferret::Analysis
ACCENTUATED_CHARS =
‘à áâãäåÄăçćÄĉċÄđèéêëēęěĕėÄğġģĥħììÃîïīĩÄįıijĵķĸłľĺļŀñńňņʼnŋòóôõöøÅÅ‘ÅÅąŕřŗśšşÅșťţŧțùúûüūůűÅũųŵýÿŷžżź’
REPLACEMENT_CHARS =
‘aaaaaaaacccccddeeeeeeeeegggghhiiiiiiiijjjjkklllllnnnnnnooooooooooqrrrsssssttttuuuuuuuuuuwyyyzzz’
MAPPING = {
['à ','á','â','ã','ä','Ã¥','Ä','ă'] => 'a',
'æ' => 'ae',
['Ä','Ä‘'] => 'd',
['ç','ć','Ä','ĉ','Ä‹'] => 'c',
['è','é','ê','ë','ē','ę','ě','ĕ','ė'] => 'e',
['Æ’'] => 'f',
['Ä','ÄŸ','Ä¡','Ä£'] => 'g',
['ĥ','ħ'] => 'h',
['ì','ì','Ã','î','ï','Ä«','Ä©','Ä'] => 'i',
['į','ı','ij','ĵ'] => 'j',
['ķ','ĸ'] => 'k',
['ł','ľ','ĺ','ļ','ŀ'] => 'l',
['ñ','ń','ň','ņ','ʼn','ŋ'] => 'n',
['ò','ó','ô','õ','ö','ø','Å','Å‘','Å','Å'] => 'o',
['Å“'] => 'oek',
['Ä…'] => 'q',
['Å•','Å™','Å—'] => 'r',
['Å›','Å¡','ÅŸ','Å','È™'] => 's',
['ť','ţ','ŧ','ț'] => 't',
['ù','ú','û','ü','Å«','ů','ű','Å','Å©','ų'] => 'u',
['ŵ'] => 'w',
['ý','ÿ','ŷ'] => 'y',
['ž','ż','ź'] => 'z'
}
class TokenFilter < TokenStream
# Construct a token stream filtering the given input.
def initialize(input)
@input = input
end
end
replace accentuated chars with ASCII one
class ToASCIIFilter < TokenFilter
def next()
token = @input.next()
unless token.nil?
token.text = token.text.tr(ACCENTUATED_CHARS, REPLACEMENT_CHARS)
end
token
end
end
class EuropeanAnalyzer < StandardAnalyzer
def token_stream(field, string)
if defined?(MappingFilter)
return MappingFilter.new(super, MAPPING) # 0.11.x
else
return ToASCIIFilter.new(super) # 0.10.x
end
end
end
end