Bug when assigning new analyzer?

require ‘rubygems’
require ‘ferret’
include Ferret

PATH = ‘/tmp/ferret_stopwords_test’

index = Index::IndexWriter.new(:path => PATH, :create => true)

index.analyzer = Analysis::StandardAnalyzer.new([])
index << {:title => ‘a few good men’, :language => ‘en’}

index.analyzer = Analysis::StandardAnalyzer.new([‘men’])
index << {:title => ‘a few good men’, :language => ‘nl’}

index.close

searcher = Index::Index.new(:path => PATH)
puts searcher.search(’*:men AND language:nl’).total_hits
#=> 1

i’d expect zero results, as ‘men’ is a stopword at the time of indexing
with language:nl. is this a bug or a lack of understanding on my part.

a workaround would be to close and reopen the index after every
language, that returns the expected zero, as expected. don’T know how
much overhead that would be.

i am on ruby 1.8.5 / os x.

any assistance would be greatly appreciated since i have no clue why
this happens …

cheers,
phillip

  • addendum 1: i use ferret 0.11.4

  • addendum 2: when i comment out the first index.analyzer assignment, i
    get:
    /Users/phillip/Sites/ruby/playground/ferret_stopwords.rb:13: [BUG] Bus
    Error
    ruby 1.8.5 (2006-12-25) [i686-darwin8.8.2]

  • addendum 3: the underlying problem i have is that i have many
    different languages that have to be correctly indexed. is there a best
    practise how to do that? i mean, better than having one index and
    switching the analyzer around?

thanks again,
phillip

On Wed, May 09, 2007 at 11:59:59PM +0200, Phillip O. wrote:

with language:nl. is this a bug or a lack of understanding on my part.
Queries get analyzed, too, i.e. to remove stop words from them. So
you’ll have to use the correct language-dependent Analyzer for your
searcher, too.

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

hi jens,

thanks for making that clear, and sorry for the long delay in replying.
we were quite busy.

cheers,
phillip