These are very good questions indeed. I’m afraid I don’t have the
answers but I’d like to add some questions and remarks of my own and
hope someone will eventually provide some insight.
On 02.11.2006, at 23:57, Chris Gansen wrote:
doc = Document.import_from_xml(filename)
Ferret::locale = doc.locale_id # locale_id is "en.UTF-8" or
“fr.UTF-8” for example
I don’t think setting the locale has any effect on already created
StemFilters and StopFilters, so the above code doesn’t change anything.
According to the docs the locale setting doesn’t even affect the
default stop words or stemming algorithms used when creating a new
StopFilter or StemFilter, respectively. The default language is
English in both cases, no matter what the current locale is.
This leads me to the ultimate question: What is the locale setting
good for anyway? Could it be that only the character encoding portion
of the locale string is actually relevant?
What’s the best way to handle the import of data, where locale is
changing from document to document? What other considerations
should I keep in mind when using Ferret across multiple locales?
From what I have observed, you’ll need to create different Analyzers
with a StemFilter and StopFilter explicitly created for the
I don’t know about French but the German stemming algorithm is very
inaccurate. Stemming algorithms for the English language are probably
easier to implement, since German and French have more complex rules
and lots of exceptions. But even the English stemming algorithm seems
to be entirely rule-based and thus fails on irregular verbs. I think
it might be a good idea to provide a facility to extend the stemmer,
very much like the inflection rules can be extended in Rails.