Ferret DRB, UTF-8, Mongrel

I have spent days trying to figure out how to get UTF-8 working with my
site.

Here’s my environment:

Linux version 2.6.16.29-xen_3.0.3.0
Ruby 1.8.4 (2005-12-24 [i386-linux]
Rails 1.2.3
mongrel (1.0.1)
mongrel_cluster (1.0.2, 0.2.1)
ferret (0.11.4)
acts_as_ferret stable plugin
Ferret DRB server

When I don’t use an analyzer with my acts_as_ferret declaration,
everything works fine. However, I can’t expect users to enter “Álex
Rodríguez” when searching… they’re going to put “alex rodriguez” (or
some variation of his name, which I handle using a fuzzy search)

So then call an analyzer in my acts_as_ferret declaration:

acts_as_ferret({ :fields => {:first_name => {:store => :no},
:last_name => {:store => :no},
:db_state => {:index =>
:untokenized_omit_norms, :term_vector => :no}},
:remote => true}, {:analyzer => UtfAnalyzer.new})

Here’s the analyzer I’m using… pretty much taken from from here:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/MappingFilter.html


class UtfAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
CHARACTER_MAPPINGS = {
['à ',‘á’,‘â’,‘ã’,‘ä’,‘Ã¥’,‘ā’,‘ă’] => ‘a’,
‘æ’ => ‘ae’,
[‘ď’,‘Ä‘’] => ‘d’,
[‘ç’,‘ć’,‘č’,‘ĉ’,‘Ä‹’] => ‘c’,
[‘è’,‘é’,‘ê’,‘ë’,‘Ä“’,‘Ä™’,‘Ä›’,‘Ä•’,‘Ä—’,] => ‘e’,
[‘Æ’’] => ‘f’,
[‘ĝ’,‘ÄŸ’,‘Ä¡’,‘Ä£’] => ‘g’,
[‘Ä¥’,‘ħ’] => ‘h’,
[‘ì’,‘ì’,‘í’,‘î’,‘ï’,‘Ä«’,‘Ä©’,‘Ä­’] => ‘i’,
[‘į’,‘ı’,‘ij’,‘ĵ’] => ‘j’,
[‘Ä·’,‘ĸ’] => ‘k’,
[‘Å‚’,‘ľ’,‘ĺ’,‘ļ’,‘Å€’] => ‘l’,
[‘ñ’,‘Å„’,‘ň’,‘ņ’,‘ʼn’,‘Å‹’] => ‘n’,
[‘ò’,‘ó’,‘ô’,‘õ’,‘ö’,‘ø’,‘ō’,‘Å‘’,‘ŏ’,‘ŏ’] => ‘o’,
[‘Å“’] => ‘oek’,
[‘Ä…’] => ‘q’,
[‘Å•’,‘Å™’,‘Å—’] => ‘r’,
[‘Å›’,‘Å¡’,‘ÅŸ’,‘ŝ’,‘È™’] => ‘s’,
[‘Å¥’,‘Å£’,‘ŧ’,‘È›’] => ‘t’,
[‘ù’,‘ú’,‘û’,‘ü’,‘Å«’,‘ů’,‘ű’,‘Å­’,‘Å©’,‘ų’] => ‘u’,
[‘ŵ’] => ‘w’,
[‘ý’,‘ÿ’,‘Å·’] => ‘y’,
[‘ž’,‘ż’,‘ź’] => ‘z’
}

def token_stream(field, str)
MappingFilter.new(StandardTokenizer.new(str), CHARACTER_MAPPINGS)
end

end

I think Ferret is working fine… because when I run some tests, the
mapping filter correctly pulls out the accented characters… exactly as
it should.

However, when something is persisted via the model (acts_as_ferret and
DRB server), I get unexpected behavior…

  • using a model with ONE field declared in acts_as_ferret, and a string
    with accented characters – I can search it as expected - with either
    accented or non-accented character, adn I get the results returned;
    however, I don’t get any other results for the non-accented records.
    ONLY the accented records get returned when searching.

  • using a model with multiple characters defined (as in Player model
    above) – nothing gets returned, neither accented or non-accented
    records, or any combination

My ferret_server.log file shows characters that are very different from
the accented characters I’m trying to search on…

Search entered in form: Álex Rodríguez
ferret_server.log: Ãlex rodríguez

Not sure why this is occuring, but I’ve also redisplayed the submitted
text on a web page and it displays correctly. This leads me to believe
that Ruby/Rails is successfully getting the information, and that html
page encoding is correct, along with environment variables, etc… As I
stated earlier, my Ferret test takes the string “Rodríguez” and returns
token[“Rodriguez”:0:10:1] demonstrating the UtfAnalyzer works fine
outside of acts_as_ferret…

So any help here would be much appreciated.

Thanks,

Brandon

Hi!

This is really strange - are you sure the DRb server runs in a proper
utf8 environment, just as your testcases do?

Jens

On Thu, Sep 20, 2007 at 08:01:48PM +0200, Brandon Kelly wrote:

ferret (0.11.4)

['ĝ','ğ','ġ','ģ']                         => 'g',
['ś','š','ş','ŝ','ș']                     => 's',

with accented characters – I can search it as expected - with either

http://rubyforge.org/mailman/listinfo/ferret-talk

Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

Thanks for the quick response Jens.

Okay – my problem apparently is that I’ve been deploying new code
(which stops and starts the ferret server), then I would go in and
delete the index. So the index gets recreated, but the DRB server
“remembers” the previous index, or settings, or whatever.

When I follow these steps, the index is created correctly, and the
analyzer works fine…

  1. deploy new code
  2. script/ferret_stop
  3. rm -rf index/production
  4. script/ferret_start

The key for me to remember is to stop the DRB server BEFORE deleting the
index.

I’ve created a simple capistrano recipe to handle this in the future.

Thanks again.

  • Brandon

Jens K. wrote:
Hi!

This is really strange - are you sure the DRb server runs in a proper
utf8 environment, just as your testcases do?

Jens

On Fri, Sep 21, 2007 at 02:31:04AM +0200, Brandon Kelly wrote:

  1. deploy new code
  2. script/ferret_stop
  3. rm -rf index/production
  4. script/ferret_start

The key for me to remember is to stop the DRB server BEFORE deleting the
index.

I’ve created a simple capistrano recipe to handle this in the future.

cool. I usually put the index directory into shared/ and symlink it into
the current release during deploment. This saves you the index rebuild
after deploying.

Cheers,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

I do have the site setup this way.

My deploy script stops and starts the DRB server without touching the
index (which is what I want most of the time).

My problem arose when I needed to delete the index. I’d deploy new
code, DRB would restart with the old index in place… then I’d delete
the old index (while DRB server was running)… and watch it rebuild.
The rebuilt index had problems. Wasn’t until I realized I need to
delete the index only when DRB server isnt’ running. (at least that
works for me).

Thanks again.

cool. I usually put the index directory into shared/ and symlink it into
the current release during deploment. This saves you the index rebuild
after deploying.

On Fri, Sep 21, 2007 at 04:10:37PM +0200, Brandon Kelly wrote:

works for me).
yes, deleting the index while the server is running isn’t a good idea.
You may also run Model.rebuild_index from a script after deployment to
rebuild the index, or even create a rebuild_index deployment recipe via
Capistrano.

cheers,
Jens


Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database