Forum: Ferret Problem with case sensitivity

Posted by Max Williams (max-williams)
on 2009-11-26 12:13
(Received via mailing list)
I'm using a custom stem analyser in my searches and my indexing.  The
analyser is defined thus:

module Ferret::Analysis
  class StemmingAnalyzer
    def token_stream(field, text)
      text.downcase!
      RAILS_DEFAULT_LOGGER.debug "SEARCHING, field = #{field.inspect}, 
text
= #{text.inspect}"
      tokenizer = StandardTokenizer.new(text)
      filter = StemFilter.new(tokenizer)
      filter
    end
  end
end

I use it in my indexing like this:

  acts_as_ferret({ :store_class_name => true,
                   :ferret => { :analyzer =>
Ferret::Analysis::StemmingAnalyzer.new },
                   :fields => {:property_names =>  { :boost => 3.0 },
                               ....etc
                   }})

And in a search like this:

search_class.find_ids_with_ferret(search_term, {:limit => 10000, 
:analyzer
=> Ferret::Analysis::StemmingAnalyzer.new}) do |model, r_id, score|
      r_id = r_id.to_i
      ferret_ids << r_id
      self.scores_hash[r_id] = score
end

I have a problem with case sensitivity - basically, searches only work 
when
they are lowercase: even when it looks like the text stored in the index 
is
uppercase.  From the console -

>> resource.to_doc
=> {:resource_id=>"59", :property_names=>"Bb Clarinet Clarinet Family
Woodwind Instrumental and Vocal Image Resources Types" }
>> TeachingObject.find_with_ferret("Vocal", :page => 1, :per_page =>
1000).include?(resource)
=> false
>> TeachingObject.find_with_ferret("vocal", :page => 1, :per_page =>
1000).include?(resource)
=> true

 I think i have my stemming set up wrong, i'm not sure if it is even 
being
used.  I implemented it so that searches allowed pluralised and singular
terms, and that seems to work, eg

>> TeachingObject.find_with_ferret("vocals", :page => 1, :per_page =>
1000).include?(resource)
=> true

But the case sensitivity thing has me stumped.  I thought that the 
downcase!
call on the search term would make case irrelevant for searching but 
that
seems not to be the case.  Can anyone set me straight?
Posted by Max Williams (max-williams)
on 2009-11-26 12:58
I think i fixed this.  I did three things
- changed my custom analyser to inherit from Ferret::Analysis::Analyzer
- ditched the downcase! line
- instead of doing downcase!, I added LowerCaseFilter.new(filter) to my 
chain

module Ferret::Analysis
  class StemmingAnalyzer < Ferret::Analysis::Analyzer
    def token_stream(field, text)
      RAILS_DEFAULT_LOGGER.debug "SEARCHING, field = #{field.inspect}, 
text = #{text.inspect}"
      tokenizer = StandardTokenizer.new(text)
      filter = StemFilter.new(tokenizer)
      low_filter = LowerCaseFilter.new(filter)
      low_filter
    end
  end
end

After calling ferret_update on the resource, i can now get it with 
'vocal' or 'Vocal'.

I'd still welcome any further advice on this, in case i'm not doing 
something right.

thanks, max
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.