Wildcard searches with german umlauts

i just noticed a weird problem.

i can successfully search with full terms like
“Flächendesinfektionsstufen” or “Regionalanästhesie” for example and get
correct hits.

but when i search for those entries with wildcards
“Flächendesinfektion*” or “Regionalanäs*” it won’t find anything

“*chendesinfektionsstufen” or “*sthesie” works.

so any wildcarded search that contains any umlauts will fail.

anyone any idea or hint what the problem could be?

On Mon, Oct 08, 2007 at 06:15:08PM +0200, neongrau __ wrote:

“*chendesinfektionsstufen” or “*sthesie” works.

so any wildcarded search that contains any umlauts will fail.

anyone any idea or hint what the problem could be?

I tried to reproduce this, but it worked for me:


Flächendesinfektion* : 1 hit(s)
Regionalanäs* : 1 hit(s)
*chendesinfektionsstufen : 1 hit(s)
*sthesie : 1 hit(s)


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

just found out that the case sensitive umlaut problem is in fact a bug
in ferret.

i reported it already → http://ferret.davebalmain.com/trac/ticket/326

really weird. was some problem within my rails app that caused this.
i have no idea what it was but i got it working again now.

but in the process i noticed another problem sigh

Ferret::Analysis::LowerCaseFilter in the GermanStemmngAnalyzer doesnt
seem to work on the umlauts. which causes that i cant find “Ãœbersicht”
with the query “übersicht”.

from the documentation i read that it uses “the current locale” is used.
from googling i couldn’t find any info about what setting is needed. is
it “Ferret.locale” ?

what locale do i need to set (on a windows box) with the index running
in ISO-8859-1 ?

i tried
Ferret.locale = ‘de_DE.ISO8859-1’
but that didn’t help. :frowning:

this is the GermanStemmingAnalyzer i’m using:

class GermanStemmingAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(stop_words = FULL_GERMAN_STOP_WORDS)
@stop_words = stop_words
def token_stream(field, str)
@stop_words), ‘de’, “ISO_8859_1”)

or maybe not a bug :S

so back to zero :frowning:

require ‘rubygems’
require ‘ferret’

Ferret.locale = ‘’ #“de_DE.iso88591”

i = Ferret::I.new

i << ‘Ãœbersicht’
i << ‘übersicht’

for q in [ ‘Ãœbersicht’, ‘übersicht’, ‘Ãœber*’, ‘über*’, ‘*bersicht’ ]
puts “#{q} : #{i.search(q).total_hits} hit(s)”

with an empty locale in the test script it’ll work in the new version as

but in my rails app the aaf generated index will have broken umlauts
with an empty Ferret.locale.
e.g. the word “Ãœbersicht” in the index shows this behavior when queried:
“Ãœbersicht” = hit
“übersicht” = hit
“Ãœbers*” = no hit
“übers*” = no hit
“bersicht” = hit (?!?!)

with a locale set to “de_DE.iso88591” the umlauts seem correct but case

“Ãœbersicht” = hit
“übersicht” = no hit
“Ãœbers*” = hit
“ÃœBERSICHT” = hit
“üBERSICHT” = no hit
“ÃœBERsi*” = hit

i simplified my model a bit to speed up the 200 index rebuilds i’ve done
the last days:

acts_as_ferret( { :fields => [ :title ], :remote => true }, {
:analyzer => GermanStemmingAnalyzer.new } )

def title
Iconv.new(‘ISO-8859-1’, ‘UTF-8’).iconv(self.xstrtitle.to_s)

here are a couple of terms from the index:


the lowercase umlauts seem to be properly processed by the lowercase
filter through the stemming analyzer, just the four terms on the end
that start with uppercase umlauts are unprocessed :frowning:

any idea? i can’t think of anything else i could try (except solr) :frowning: