How to deal with accentuated chars in 0.10.8?

Edgar · October 19, 2006, 9:57pm

I’m startin to use Ferret and acts_as_ferret.

I need to use something like EuropeanAnalyzer
(HugeDomains.com).

By example, if the user search by “gonzalez” you can find documents taht
contents the term “gonzÃ¡lez” (gonzález)

The EuropeanAnalyzer is based on Ferret::Analysis::TokenFilter, but
seems that in 0.10.x this is not available.

What is the way to do this ?

Edgar · October 20, 2006, 8:03am

On 10/20/06, Edgar [email protected] wrote:

What is the way to do this ?

try this. Make sure you use the -KU flag.

require ‘rubygems’
require ‘ferret’
require ‘jcode’

ACCENTUATED_CHARS =
'ÅÄÀAÂåäàâaÖÔôöÉÈÊËéèêëÜüùç’REPLACEMENT_CHARS = ‘aaaaaaaaaaooooeeeeeeeeuuuc’

module Ferret::Analysis
class TokenFilter < TokenStream
# Construct a token stream filtering the given input.
def initialize(input)
@input = input
end
end

replace accentuated chars with ASCII one

class ToASCIIFilter < TokenFilter
def next()
token = @input.next()
unless token.nil?
token.text = token.text.downcase.tr(ACCENTUATED_CHARS,
REPLACEMENT_CHARS)
end
token
end
end

class EuropeanAnalyzer
def token_stream(field, string)
return ToASCIIFilter.new(StandardTokenizer.new(string))
end
end
end

analyzer = Ferret::Analysis::EuropeanAnalyzer.new
ts = analyzer.token_stream(‘xxx’, "Let’s see what " +
“happens to
ÅÄÀAÂåäàâaÖÔôöÉÈÊËéèêëÜüùç”)while t = ts.next
puts t
end

Edgar · October 20, 2006, 4:26pm

David,

Thanks for the tip, but I’ll try your latest release (0.10.13)