On 4/4/07, Jens K. [email protected] wrote:
def token_stream field, input
pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
a method a_standard_get_ts that clones an existing token stream instance
and calls a method named reset on it, with the text to be tokenized.
I guess we’ll need Dave’s help to sort this out…
Ok, I can see why this is confusing. To try and show you how it works,
try this code;
require ‘rubygems’
require ‘ferret’
require ‘pp’
require ‘strscan’
include Ferret::Analysis
include Ferret::Index
class TestAnalyzer
class TestTokenizer
def initialize(input)
puts “initialize => (#{input})”
@input = input
end
def next()
term, @input = @input, nil
return term ? Token.new(term, 0, term.size) : nil
end
def text=(text)
puts “reset => (#{text})”
@input = text
end
end
def token_stream field, input
pp field
pp input
TestTokenizer.new(input)
end
end
pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
pfa[:test] = TestAnalyzer.new
index = Index.new(:analyzer => pfa)
index << {:test => ‘foo’}
index.search_each(‘bar’)
The output is;
:test
“”
initialize => ()
r_analysis.c, 563: cwrts_reset #<= debugging bug :-0
reset => (foo)
:test
“bar”
initialize => (bar)
There is a stray debugging comment in there which I’m embarrassed I
didn’t pick up earlier. But otherwise it should show you what is
happening. The tokenizer gets created with an empty string and then
TestTokenizer#text= gets called. This was actually an optimization for
multi-string fields. For example;
index << {:test => [‘one’, ‘two’, ‘three’]}
=>
initialize => ()
reset => (one)
reset => (two)
reset => (three)
So the tokenizer only needs to be instantiated once and then it gets
reset for each string. This is good example of premature optimization,
particularly since most people will never even have multi-string
fields like this. Getting rid of this optimization makes things a lot
clearer. The next version of Ferret will give this output;
index << {:test => [‘one’, ‘two’, ‘three’]}
=>
initialize => (one)
initialize => (two)
initialize => (three)
So Ryan, you will now get the output you expect. It will require
updating to Ferret 0.11.4 though. Is there any reason this is a
problem?
Hope that helps,
Dave