Accented characters


#1

Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer
include Ferret::Analysis
MAPPING = {
['à ',‘á’,‘â’,‘ã’,‘ä’,‘Ã¥’,‘ā’,‘ă’] => ‘a’,
‘æ’ => ‘ae’,
[‘ď’,‘Ä‘’] => ‘d’,
[‘ç’,‘ć’,‘č’,‘ĉ’,‘Ä‹’] => ‘c’,
[‘è’,‘é’,‘ê’,‘ë’,‘Ä“’,‘Ä™’,‘Ä›’,‘Ä•’,‘Ä—’,] => ‘e’,
[‘Æ’’] => ‘f’,
[‘ĝ’,‘ÄŸ’,‘Ä¡’,‘Ä£’] => ‘g’,
[‘Ä¥’,‘ħ’] => ‘h’,
[‘ì’,‘ì’,‘í’,‘î’,‘ï’,‘Ä«’,‘Ä©’,‘Ä­’] => ‘i’,
[‘į’,‘ı’,‘ij’,‘ĵ’] => ‘j’,
[‘Ä·’,‘ĸ’] => ‘k’,
[‘Å‚’,‘ľ’,‘ĺ’,‘ļ’,‘Å€’] => ‘l’,
[‘ñ’,‘Å„’,‘ň’,‘ņ’,‘ʼn’,‘Å‹’] => ‘n’,
[‘ò’,‘ó’,‘ô’,‘õ’,‘ö’,‘ø’,‘ō’,‘Å‘’,‘ŏ’,‘ŏ’] => ‘o’,
[‘Å“’] => ‘oek’,
[‘Ä…’] => ‘q’,
[‘Å•’,‘Å™’,‘Å—’] => ‘r’,
[‘Å›’,‘Å¡’,‘ÅŸ’,‘ŝ’,‘È™’] => ‘s’,
[‘Å¥’,‘Å£’,‘ŧ’,‘È›’] => ‘t’,
[‘ù’,‘ú’,‘û’,‘ü’,‘Å«’,‘ů’,‘ű’,‘Å­’,‘Å©’,‘ų’] => ‘u’,
[‘ŵ’] => ‘w’,
[‘ý’,‘ÿ’,‘Å·’] => ‘y’,
[‘ž’,‘ż’,‘ź’] => ‘z’
}
def token_stream(field, string)
return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end
end

And inserted this code at the end of environment.rb.

Im my model:

acts_as_ferret({ :fields => [ ‘name’ ] }, :analyzer =>
PortugueseAnalyzer.new)

But this did not work…

Can someone tell me what I did wrong ???

Thanks

Marcello


#2

On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra wrote:

Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer

Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help.

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#3

Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help.

Jens

Thanks Jens.
I changed from “class PortugueseAnalyzer” to
“class PortugueseAnalyzer < Ferret::Analysis::Analyzer”,
but did not work also…

Did I put this in the right place ??

Thanks

Marcello


#4

On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra wrote:

but did not work also…

Did I put this in the right place ??

I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#5

I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens,

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value ‘José Antonio’ to index

So, the name is not being traslated from UTF to ascii…
It’s the same output if I did not use the Analyzer.

Thanks


#6

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value ‘José Antonio’ to index

I included a word prejuízo… that should be translated to prejuizo…
I put some code to output information when it builds the index. This is
what a get:

Analyzing: field:nome str:prejuízo
token[“preju”:0:5:1]
token[“zo”:7:9:1]

So, the problem is that it breaks the word in two, just in the accented
character…

A guess the problem is in:

def token_stream(field, string)
return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end

But, I can’t figure how…


#7

On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra wrote:

Analyzing: field:nome str:prejuízo
token[“preju”:0:5:1]
token[“zo”:7:9:1]

With the script at http://pastie.caboo.se/63808 I get:

token[“prejuizo”:0:9:1]

It seems that Ferret doesn’t recognize the í as a character and
therefore splits the word at this position.

You have to make sure that everything in your environment is using UTF-8
as character encoding for these things to work (expecially locale
settings are relevant to ferret)

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa