Problem with case sensitivity


#1

I’m using a custom stem analyser in my searches and my indexing. The
analyser is defined thus:

module Ferret::Analysis
class StemmingAnalyzer
def token_stream(field, text)
text.downcase!
RAILS_DEFAULT_LOGGER.debug “SEARCHING, field = #{field.inspect},
text
= #{text.inspect}”
tokenizer = StandardTokenizer.new(text)
filter = StemFilter.new(tokenizer)
filter
end
end
end

I use it in my indexing like this:

acts_as_ferret({ :store_class_name => true,
:ferret => { :analyzer =>
Ferret::Analysis::StemmingAnalyzer.new },
:fields => {:property_names => { :boost => 3.0 },
…etc
}})

And in a search like this:

search_class.find_ids_with_ferret(search_term, {:limit => 10000,
:analyzer
=> Ferret::Analysis::StemmingAnalyzer.new}) do |model, r_id, score|
r_id = r_id.to_i
ferret_ids << r_id
self.scores_hash[r_id] = score
end

I have a problem with case sensitivity - basically, searches only work
when
they are lowercase: even when it looks like the text stored in the index
is
uppercase. From the console -

resource.to_doc
=> {:resource_id=>“59”, :property_names=>“Bb Clarinet Clarinet Family
Woodwind Instrumental and Vocal Image Resources Types” }

TeachingObject.find_with_ferret(“Vocal”, :page => 1, :per_page =>
1000).include?(resource)
=> false

TeachingObject.find_with_ferret(“vocal”, :page => 1, :per_page =>
1000).include?(resource)
=> true

I think i have my stemming set up wrong, i’m not sure if it is even
being
used. I implemented it so that searches allowed pluralised and singular
terms, and that seems to work, eg

TeachingObject.find_with_ferret(“vocals”, :page => 1, :per_page =>
1000).include?(resource)
=> true

But the case sensitivity thing has me stumped. I thought that the
downcase!
call on the search term would make case irrelevant for searching but
that
seems not to be the case. Can anyone set me straight?


#2

I think i fixed this. I did three things

  • changed my custom analyser to inherit from Ferret::Analysis::Analyzer
  • ditched the downcase! line
  • instead of doing downcase!, I added LowerCaseFilter.new(filter) to my
    chain

module Ferret::Analysis
class StemmingAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, text)
RAILS_DEFAULT_LOGGER.debug “SEARCHING, field = #{field.inspect},
text = #{text.inspect}”
tokenizer = StandardTokenizer.new(text)
filter = StemFilter.new(tokenizer)
low_filter = LowerCaseFilter.new(filter)
low_filter
end
end
end

After calling ferret_update on the resource, i can now get it with
‘vocal’ or ‘Vocal’.

I’d still welcome any further advice on this, in case i’m not doing
something right.

thanks, max