Forum: Ferret Accented characters

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Marcello parra (Guest)
on 2007-05-23 06:25
Hello,

I want to clean up accented characters in my index, using acts_as_ferret
in a Rails project. I searched this forum, and found the best solution
is to use an analyser.
I created somthing like this:

class PortugueseAnalyzer
  include Ferret::Analysis
  MAPPING = {
        ['à','á','â','ã','ä','å','ā','ă']         => 'a',
             'æ'                                       => 'ae',
             ['ď','đ']                                 => 'd',
             ['ç','ć','č','ĉ','ċ']                     => 'c',
             ['è','é','ê','ë','ē','ę','ě','ĕ','ė',]    => 'e',
             ['Æ’']                                     => 'f',
             ['ĝ','ğ','ġ','ģ']                         => 'g',
             ['ĥ','ħ']                                 => 'h',
             ['ì','ì','í','î','ï','ī','ĩ','ĭ']         => 'i',
             ['į','ı','ij','ĵ']                         => 'j',
             ['ķ','ĸ']                                 => 'k',
             ['ł','ľ','ĺ','ļ','ŀ']                     => 'l',
             ['ñ','ń','ň','ņ','ʼn','ŋ']                 => 'n',
             ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ'] => 'o',
             ['Å“']                                     => 'oek',
             ['Ä…']                                     => 'q',
             ['Å•','Å™','Å—']                             => 'r',
             ['ś','š','ş','ŝ','ș']                     => 's',
             ['ť','ţ','ŧ','ț']                         => 't',
             ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų'] => 'u',
             ['ŵ']                                     => 'w',
             ['ý','ÿ','ŷ']                             => 'y',
             ['ž','ż','ź']                             => 'z'
      }
  def token_stream(field, string)
    return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
  end
end


And inserted this code at the end of environment.rb.



Im my model:

acts_as_ferret({ :fields => [ 'name' ] }, :analyzer =>
PortugueseAnalyzer.new)



But this did not work....

Can someone tell me what I did wrong ????

Thanks


Marcello
Jens K. (Guest)
on 2007-05-23 11:53
(Received via mailing list)
On Wed, May 23, 2007 at 04:25:12AM +0200, Marcello parra wrote:
> Hello,
>
> I want to clean up accented characters in my index, using acts_as_ferret
> in a Rails project. I searched this forum, and found the best solution
> is to use an analyser.
> I created somthing like this:
>
> class PortugueseAnalyzer

Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
seem to be necessary API-wise, but imho this should help.

Jens


--
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa
Marcello parra (Guest)
on 2007-05-23 13:42
> Try inheriting your analyzer from Ferret::Analysis::Analyzer. Does not
> seem to be necessary API-wise, but imho this should help.
>
> Jens
>


Thanks Jens.
I changed from "class PortugueseAnalyzer" to
"class PortugueseAnalyzer < Ferret::Analysis::Analyzer",
but did not work also....

Did I put this in the right place ??

Thanks

Marcello
Jens K. (Guest)
on 2007-05-23 13:46
(Received via mailing list)
On Wed, May 23, 2007 at 11:42:04AM +0200, Marcello parra wrote:
> but did not work also....
>
> Did I put this in the right place ??

I think so. To help debugging this a small ruby skript reproducing the
exact problem would be cool.

Jens


--
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa
Marcello parra (Guest)
on 2007-05-23 14:21
> I think so. To help debugging this a small ruby skript reproducing the
> exact problem would be cool.
>

Jens,

In the log, I get:

creating doc for class: Conta, id: 164
Adding field name with value 'José Antonio' to index


So, the name is not being traslated from UTF to ascii....
It's the same output if I did not use the Analyzer.

Thanks
Marcello parra (Guest)
on 2007-05-23 14:43
> In the log, I get:
>
> creating doc for class: Conta, id: 164
> Adding field name with value 'José Antonio' to index


I included a word prejuízo... that should be translated to prejuizo...
I put some code to output information when it builds the index. This is
what a get:

Analyzing: field:nome  str:prejuízo
token["preju":0:5:1]
token["zo":7:9:1]


So, the problem is that it breaks the word in two, just in the accented
character...

A guess the problem is in:

def token_stream(field, string)
    return MappingFilter.new(StandardTokenizer.new(string), MAPPING)
end


But, I can't figure how.....
Jens K. (Guest)
on 2007-05-23 15:38
(Received via mailing list)
On Wed, May 23, 2007 at 12:43:21PM +0200, Marcello parra wrote:
> Analyzing: field:nome  str:prejuízo
> token["preju":0:5:1]
> token["zo":7:9:1]

With the script at http://pastie.caboo.se/63808 I get:

token["prejuizo":0:9:1]

It seems that Ferret doesn't recognize the í as a character and
therefore splits the word at this position.

You have to make sure that everything in your environment is using UTF-8
as character encoding for these things to work (expecially locale
settings are relevant to ferret)

Jens

--
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa
This topic is locked and can not be replied to.