I would like to thank all the people who have contributed to this very
fine project. Great work!
I’ve encountered some strange results while examining the term frequency
of one of my indexed documents. The indexed terms seem to vary for the
very same document depending on the presence or absence of completely
unrelated operations in the code, so the resulting term frequency
changes, too.
I repeatedly call ‘index_reader.term_docs_for’ for the only document
I’ve indexed in the snippet below, but depending on the presence of the
statement
‘dummy_count = 0’ or some formatting code for the output the resulting
term frequencies change from correct answers to wrong ones. Sometimes
terms are not
found at all.
For better examination I add a complete snippet which produce this
behavior on my system (the text is taken from
Entgelt – Wikipedia). I’m
working with ferret Version 0.11.3, C extensions compiled with VC6.0
(but the 0.10.9-mswin32 binaries from the ferret gem show the same
behavior), and ruby
version 1.8.5.
Has anybody an explanation for that or do I misuse something?
require ‘rubygems’
require ‘ferret’
$KCODE=‘u’
text = <<END_OF_TEXT
Der Begriff Entgelt (n.; Plural “Entgelte”) bezeichnet die in einem
Vertrag…
END_OF_TEXT
class StemAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, str)
return
Ferret::Analysis::StemFilter.new(Ferret::Analysis::StandardTokenizer.new(str),“german”)
end
end
puts “Using Ferret v#{Ferret::VERSION}…”
puts “Using Ruby v#{VERSION}…”
@index = Ferret::I.new(:analyzer => StemAnalyzer.new())
@index << {:title => “Entgelt”, :content => text}
#dummy_count = 0
index_reader = @index.reader
tde=index_reader.term_docs_for(:content, “Vertrag”)
tde.each{|did,freq| puts “Term 'Vertrag' occurs in Document
'#{@index[did][:title]}' #{freq} times (5 expected)\n”}
tde=index_reader.term_docs_for(:content, “BGB”)
tde.each{|did,freq| puts “Term 'BGB' occurs in Document
'#{@index[did][:title]}' #{freq} times (3 expected)\n”}
tde=index_reader.term_docs_for(:content, “Leistung”)
tde.each{|did,freq| puts “Term 'Leistung' occurs in Document
'#{@index[did][:title]}' #{freq} times (12 expected)\n”}
Output:
=> Using Ferret v0.11.3…
=> Using Ruby v1.8.5…
=> Term ‘Vertrag’ occurs in Document ‘Entgelt’ 4 times (5 expected)
=> Term ‘Leistung’ occurs in Document ‘Entgelt’ 3 times (12 expected)
Ouput after removing the comment in ‘dummy_count=0’:
=> Using Ferret v0.11.3…
=> Using Ruby v1.8.5…
=> Term ‘Vertrag’ occurs in Document ‘Entgelt’ 5 times (5 expected)
=> Term ‘BGB’ occurs in Document ‘Entgelt’ 3 times (3 expected)
=> Term ‘Leistung’ occurs in Document ‘Entgelt’ 12 times (12 expected)