Wildcard searches with german umlauts

neongrau · October 8, 2007, 6:15pm

i just noticed a weird problem.

i can successfully search with full terms like
“FlÃ¤chendesinfektionsstufen” or “RegionalanÃ¤sthesie” for example and get
correct hits.

but when i search for those entries with wildcards
“FlÃ¤chendesinfektion*” or “RegionalanÃ¤s*” it won’t find anything

while
“*chendesinfektionsstufen” or “*sthesie” works.

so any wildcarded search that contains any umlauts will fail.

anyone any idea or hint what the problem could be?

neongrau · October 9, 2007, 9:24am

On Mon, Oct 08, 2007 at 06:15:08PM +0200, neongrau __ wrote:

“*chendesinfektionsstufen” or “*sthesie” works.

so any wildcarded search that contains any umlauts will fail.

anyone any idea or hint what the problem could be?

I tried to reproduce this, but it worked for me:
http://pastie.caboo.se/105218

yields:

Flächendesinfektion* : 1 hit(s)
Regionalanäs* : 1 hit(s)
*chendesinfektionsstufen : 1 hit(s)
*sthesie : 1 hit(s)

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

neongrau · October 9, 2007, 5:07pm

just found out that the case sensitive umlaut problem is in fact a bug
in ferret.

i reported it already → http://ferret.davebalmain.com/trac/ticket/326

neongrau · October 9, 2007, 3:34pm

really weird. was some problem within my rails app that caused this.
i have no idea what it was but i got it working again now.

but in the process i noticed another problem sigh

Ferret::Analysis::LowerCaseFilter in the GermanStemmngAnalyzer doesnt
seem to work on the umlauts. which causes that i cant find “Ãœbersicht”
with the query “Ã¼bersicht”.

from the documentation i read that it uses “the current locale” is used.
from googling i couldn’t find any info about what setting is needed. is
it “Ferret.locale” ?

what locale do i need to set (on a windows box) with the index running
in ISO-8859-1 ?

i tried
Ferret.locale = ‘de_DE.ISO8859-1’
but that didn’t help.

this is the GermanStemmingAnalyzer i’m using:

class GermanStemmingAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(stop_words = FULL_GERMAN_STOP_WORDS)
@stop_words = stop_words
end
def token_stream(field, str)
StemFilter.new(StopFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)),
@stop_words), ‘de’, “ISO_8859_1”)
end
end

neongrau · October 10, 2007, 3:10pm

or maybe not a bug :S

so back to zero

require ‘rubygems’
require ‘ferret’

Ferret.locale = ‘’ #“de_DE.iso88591”

i = Ferret::I.new

i << ‘Ãœbersicht’
i << ‘Ã¼bersicht’

for q in [ ‘Ãœbersicht’, ‘Ã¼bersicht’, ‘Ãœber*’, ‘Ã¼ber*’, ‘*bersicht’ ]
puts “#{q} : #{i.search(q).total_hits} hit(s)”
end

with an empty locale in the test script it’ll work in the new version as
well.

but in my rails app the aaf generated index will have broken umlauts
with an empty Ferret.locale.
e.g. the word “Ãœbersicht” in the index shows this behavior when queried:
“Ãœbersicht” = hit
“Ã¼bersicht” = hit
“Ãœbers*” = no hit
“Ã¼bers*” = no hit
“bersicht” = hit (?!?!)

with a locale set to “de_DE.iso88591” the umlauts seem correct but case
sensitive.

Query
“Ãœbersicht” = hit
“Ã¼bersicht” = no hit
“Ãœbers*” = hit
“ÃœBERSICHT” = hit
“Ã¼BERSICHT” = no hit
“ÃœBERsi*” = hit

i simplified my model a bit to speed up the 200 index rebuilds i’ve done
the last days:

acts_as_ferret( { :fields => [ :title ], :remote => true }, {
:analyzer => GermanStemmingAnalyzer.new } )

def title
Iconv.new(‘ISO-8859-1’, ‘UTF-8’).iconv(self.xstrtitle.to_s)
end

here are a couple of terms from the index:

[“massnahm”,2],
[“medi”,1],
[“medikament”,1],
[“patientenwert”,1],
[“patientinn”,1],
[“prufprotokoll”,1],
[“regionalanasthesi”,2],
[“reisekostenabrechn”,1],
[“reparaturanzeig”,2],
[“schwachelt”,1],
[“sonderw”,1],
[“ssnahmenkurz”,1],
[“stundenabrechn”,2],
[“sturzereignisprotokoll”,1],
[“urlaubsubertrag”,1],
[“verwalt”,1],
[“zuschlagsformular”,1],
[“zytostatica”,2],
[“Ã„quivalenzdos”,1],
[“Ãœbergabeprotokoll”,1],
[“Ãœbersicht”,1],
[“Ãœberstundendokumentation”,1]]

the lowercase umlauts seem to be properly processed by the lowercase
filter through the stemming analyzer, just the four terms on the end
that start with uppercase umlauts are unprocessed

any idea? i can’t think of anything else i could try (except solr)