Accented characters

Marcello_parra · May 23, 2007, 4:25am

Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer
include Ferret::Analysis
MAPPING = {
['Ã ',‘Ã¡’,‘Ã¢’,‘Ã£’,‘Ã¤’,‘Ã¥’,‘Ä’,‘Äƒ’] => ‘a’,
‘Ã¦’ => ‘ae’,
[‘Ä’,‘Ä‘’] => ‘d’,
[‘Ã§’,‘Ä‡’,‘Ä’,‘Ä‰’,‘Ä‹’] => ‘c’,
[‘Ã¨’,‘Ã©’,‘Ãª’,‘Ã«’,‘Ä“’,‘Ä™’,‘Ä›’,‘Ä•’,‘Ä—’,] => ‘e’,
[‘Æ’’] => ‘f’,
[‘Ä’,‘ÄŸ’,‘Ä¡’,‘Ä£’] => ‘g’,
[‘Ä¥’,‘Ä§’] => ‘h’,
[‘Ã¬’,‘Ã¬’,‘Ã’,‘Ã®’,‘Ã¯’,‘Ä«’,‘Ä©’,‘Ä’] => ‘i’,
[‘Ä¯’,‘Ä±’,‘Ä³’,‘Äµ’] => ‘j’,
[‘Ä·’,‘Ä¸’] => ‘k’,
[‘Å‚’,‘Ä¾’,‘Äº’,‘Ä¼’,‘Å€’] => ‘l’,
[‘Ã±’,‘Å„’,‘Åˆ’,‘Å†’,‘Å‰’,‘Å‹’] => ‘n’,
[‘Ã²’,‘Ã³’,‘Ã´’,‘Ãµ’,‘Ã¶’,‘Ã¸’,‘Å’,‘Å‘’,‘Å’,‘Å’] => ‘o’,
[‘Å“’] => ‘oek’,
[‘Ä…’] => ‘q’,
[‘Å•’,‘Å™’,‘Å—’] => ‘r’,
[‘Å›’,‘Å¡’,‘ÅŸ’,‘Å’,‘È™’] => ‘s’,
[‘Å¥’,‘Å£’,‘Å§’,‘È›’] => ‘t’,
[‘Ã¹’,‘Ãº’,‘Ã»’,‘Ã¼’,‘Å«’,‘Å¯’,‘Å±’,‘Å’,‘Å©’,‘Å³’] => ‘u’,
[‘Åµ’] => ‘w’,
[‘Ã½’,‘Ã¿’,‘Å·’] => ‘y’,
[‘Å¾’,‘Å¼’,‘Åº’] => ‘z’
}
def token_stream(field, string)
return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end
end

And inserted this code at the end of environment.rb.

Im my model:

acts_as_ferret({ :fields => [ ‘name’ ] }, :analyzer =>
PortugueseAnalyzer.new)

But this did not work…

Can someone tell me what I did wrong ???

Thanks

Marcello

Marcello_parra · May 23, 2007, 9:53am

On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra wrote:

Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer

Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Marcello_parra · May 23, 2007, 11:42am

Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help.

Jens

Thanks Jens.
I changed from “class PortugueseAnalyzer” to
“class PortugueseAnalyzer < Ferret::Analysis::Analyzer”,
but did not work also…

Did I put this in the right place ??

Thanks

Marcello

Marcello_parra · May 23, 2007, 11:46am

On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra wrote:

but did not work also…

Did I put this in the right place ??

I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Marcello_parra · May 23, 2007, 12:21pm

I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens,

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value ‘JosÃƒÂ© Antonio’ to index

So, the name is not being traslated from UTF to ascii…
It’s the same output if I did not use the Analyzer.

Thanks

Marcello_parra · May 23, 2007, 12:43pm

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value ‘JosÃƒÂ© Antonio’ to index

I included a word prejuÃzo… that should be translated to prejuizo…
I put some code to output information when it builds the index. This is
what a get:

Analyzing: field:nome str:prejuÃƒÂzo
token[“preju”:0:5:1]
token[“zo”:7:9:1]

So, the problem is that it breaks the word in two, just in the accented
character…

A guess the problem is in:

def token_stream(field, string)
return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end

But, I can’t figure how…

Marcello_parra · May 23, 2007, 1:38pm

On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra wrote:

Analyzing: field:nome str:prejuÃzo
token[“preju”:0:5:1]
token[“zo”:7:9:1]

With the script at Parked at Loopia I get:

token[“prejuizo”:0:9:1]

It seems that Ferret doesn’t recognize the í as a character and
therefore splits the word at this position.

You have to make sure that everything in your environment is using UTF-8
as character encoding for these things to work (expecially locale
settings are relevant to ferret)

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa