Indexing and searching across multiple locales

Chris_Gansen · November 7, 2006, 10:56am

Hi -

I’m currently investigating support for Ferret and content that spans
multiple locales. I am particularly interested in using stemming and
fuzzy
searches (e.g. with slop factor) across multiple locales.

So far I’ve followed the online docs for implementing a Stemming
Analyzer,
and it is working for English terms just fine. I’ve also written a
method to
import data from the legacy XML files and save as ActiveRecord objects
(using AAF). However, I’m not certain the the locale-switching is
working
properly:

doc = Document.import_from_xml(filename)
Ferret::locale = doc.locale_id    # locale_id is "en.UTF-8" or

“fr.UTF-8”
for example
doc.save

What’s the best way to handle the import of data, where locale is
changing
from document to document? What other considerations should I keep in
mind
when using Ferret across multiple locales?

Thanks for any tips!
–chris

Chris_Gansen · November 7, 2006, 10:59am

These are very good questions indeed. I’m afraid I don’t have the
answers but I’d like to add some questions and remarks of my own and
hope someone will eventually provide some insight.

On 02.11.2006, at 23:57, Chris Gansen wrote:

doc = Document.import_from_xml(filename)
Ferret::locale = doc.locale_id    # locale_id is "en.UTF-8" or

“fr.UTF-8” for example
doc.save

I don’t think setting the locale has any effect on already created
StemFilters and StopFilters, so the above code doesn’t change anything.

According to the docs the locale setting doesn’t even affect the
default stop words or stemming algorithms used when creating a new
StopFilter or StemFilter, respectively. The default language is
English in both cases, no matter what the current locale is.

This leads me to the ultimate question: What is the locale setting
good for anyway? Could it be that only the character encoding portion
of the locale string is actually relevant?

What’s the best way to handle the import of data, where locale is
changing from document to document? What other considerations
should I keep in mind when using Ferret across multiple locales?

From what I have observed, you’ll need to create different Analyzers
with a StemFilter and StopFilter explicitly created for the
respective locale.

I don’t know about French but the German stemming algorithm is very
inaccurate. Stemming algorithms for the English language are probably
easier to implement, since German and French have more complex rules
and lots of exceptions. But even the English stemming algorithm seems
to be entirely rule-based and thus fails on irregular verbs. I think
it might be a good idea to provide a facility to extend the stemmer,
very much like the inflection rules can be extended in Rails.

Cheers,
Andy

Chris_Gansen · November 7, 2006, 10:59am

On 11/3/06, Andreas K. [email protected] wrote:

These are very good questions indeed. I’m afraid I don’t have the
answers but I’d like to add some questions and remarks of my own and
hope someone will eventually provide some insight.

Thanks for the response. I guess my real question is: how have other
people
handled indexing data across many locales? What works and what doesn’t?
From
my initial work, the basic indexing works across languages; however,
it’s
the “fun” stuff like stemming and fuzzy searches that I am particularly
interested in.

Any pointers are appreciated.
–chris

Chris_Gansen · November 7, 2006, 11:01am

Chris Gansen schrieb:

Thanks for the response. I guess my real question is: how have other
people handled indexing data across many locales? What works and what
doesn’t? From my initial work, the basic indexing works across
languages; however, it’s the “fun” stuff like stemming and fuzzy
searches that I am particularly interested in.
Hey Chris,

i store content in different languages in different fields… i have an
object, that has content in de/pl/en and i got a field content_de,
content_en and content_pl for that object. now i can implement a
per_field_analyzer to stem each field in its locale.
this might not exactly match your example, as this is really one
db-object with different translations attached to it, not different
objects in different languages.

Ben