Strange Results For Term Frequencies

I would like to thank all the people who have contributed to this very
fine project. Great work!

I’ve encountered some strange results while examining the term frequency
of one of my indexed documents. The indexed terms seem to vary for the
very same document depending on the presence or absence of completely
unrelated operations in the code, so the resulting term frequency
changes, too.

I repeatedly call ‘index_reader.term_docs_for’ for the only document
I’ve indexed in the snippet below, but depending on the presence of the
statement
‘dummy_count = 0’ or some formatting code for the output the resulting
term frequencies change from correct answers to wrong ones. Sometimes
terms are not
found at all.

For better examination I add a complete snippet which produce this
behavior on my system (the text is taken from
http://de.wikipedia.org/wiki/Entgelt). I’m
working with ferret Version 0.11.3, C extensions compiled with VC6.0
(but the 0.10.9-mswin32 binaries from the ferret gem show the same
behavior), and ruby
version 1.8.5.

Has anybody an explanation for that or do I misuse something?

require ‘rubygems’
require ‘ferret’

$KCODE=‘u’

text = <<END_OF_TEXT
Der Begriff Entgelt (n.; Plural “Entgelte”) bezeichnet die in einem
Vertrag…
END_OF_TEXT

class StemAnalyzer < Ferret::Analysis::Analyzer
def token_stream(field, str)
return
Ferret::Analysis::StemFilter.new(Ferret::Analysis::StandardTokenizer.new(str),“german”)
end
end

puts “Using Ferret v#{Ferret::VERSION}…”
puts “Using Ruby v#{VERSION}…”

@index = Ferret::I.new(:analyzer => StemAnalyzer.new())

@index << {:title => “Entgelt”, :content => text}

#dummy_count = 0

index_reader = @index.reader

tde=index_reader.term_docs_for(:content, “Vertrag”)
tde.each{|did,freq| puts “Term ‘Vertrag’ occurs in Document
‘#{@index[did][:title]}’ #{freq} times (5 expected)\n”}

tde=index_reader.term_docs_for(:content, “BGB”)
tde.each{|did,freq| puts “Term ‘BGB’ occurs in Document
‘#{@index[did][:title]}’ #{freq} times (3 expected)\n”}

tde=index_reader.term_docs_for(:content, “Leistung”)
tde.each{|did,freq| puts “Term ‘Leistung’ occurs in Document
‘#{@index[did][:title]}’ #{freq} times (12 expected)\n”}

Output:
=> Using Ferret v0.11.3…
=> Using Ruby v1.8.5…
=> Term ‘Vertrag’ occurs in Document ‘Entgelt’ 4 times (5 expected)
=> Term ‘Leistung’ occurs in Document ‘Entgelt’ 3 times (12 expected)

Ouput after removing the comment in ‘dummy_count=0’:
=> Using Ferret v0.11.3…
=> Using Ruby v1.8.5…
=> Term ‘Vertrag’ occurs in Document ‘Entgelt’ 5 times (5 expected)
=> Term ‘BGB’ occurs in Document ‘Entgelt’ 3 times (3 expected)
=> Term ‘Leistung’ occurs in Document ‘Entgelt’ 12 times (12 expected)

On 3/21/07, Thomas S. [email protected] wrote:

I’ve indexed in the snippet below, but depending on the presence of the
(but the 0.10.9-mswin32 binaries from the ferret gem show the same
behavior), and ruby
version 1.8.5.

Has anybody an explanation for that or do I misuse something?
Test Code

Hi Thomas,

Firstly, well done compiling Ferret on Windows and thanks for posting
this. The reason I haven’t yet released a win32 gem is that I’m still
trying to work out the String#dump issue which is wreaking havoc when
people try and use Ferret with Rails on Windows. I suspect this issue
of yours is somehow related. I’ll let you know as soon as I find a
solution.

Cheers,
Dave

David B. wrote:

http://de.wikipedia.org/wiki/Entgelt). I’m
working with ferret Version 0.11.3, C extensions compiled with VC6.0
(but the 0.10.9-mswin32 binaries from the ferret gem show the same
behavior), and ruby
version 1.8.5.

Has anybody an explanation for that or do I misuse something?
Test Code

I ran the test code on both the 0.10.9 win32 gem and on Cygwin on 0.11.3

Here are the results:

dummy_count = 0

Using Ferret v0.10.9…
Using Ruby v1.8.5…
Term ‘Vertrag’ occurs in Document ‘Entgelt’ 4 times (5 expected)
Term ‘BGB’ occurs in Document ‘Entgelt’ 1 times (3 expected)
Term ‘Leistung’ occurs in Document ‘Entgelt’ 5 times (12 expected)

Using Ferret v0.11.3…
Using Ruby v1.8.5…
Term ‘Vertrag’ occurs in Document ‘Entgelt’ 5 times (5 expected)
Term ‘BGB’ occurs in Document ‘Entgelt’ 9 times (3 expected)
Term ‘Leistung’ occurs in Document ‘Entgelt’ 12 times (12 expected)

dummy_count = 0

C:\Documents and Settings\Patrick R.\ruby>ruby tf_test.rb
Using Ferret v0.10.9…
Using Ruby v1.8.5…
Term ‘Vertrag’ occurs in Document ‘Entgelt’ 4 times (5 expected)
Term ‘BGB’ occurs in Document ‘Entgelt’ 1 times (3 expected)
Term ‘Leistung’ occurs in Document ‘Entgelt’ 5 times (12 expected)

Using Ferret v0.11.3…
Using Ruby v1.8.5…
Term ‘Vertrag’ occurs in Document ‘Entgelt’ 5 times (5 expected)
Term ‘BGB’ occurs in Document ‘Entgelt’ 9 times (3 expected)
Term ‘Leistung’ occurs in Document ‘Entgelt’ 12 times (12 expected)

Results don’t seem to change when dummy_count is set, I think the
difference between Cygwin and the straight win32 build is the UTF-8
support.

Cheers!
Patrick