Trouble with PerFieldAnalyzer

ryank · March 29, 2007, 12:02am

I’m having trouble with PerFieldAnalyzer (ferret version 0.10.14).

Script:
require ‘rubygems’
require ‘ferret’
require ‘pp’

include Ferret::Analysis
include Ferret::Index

class TestAnalyzer
def token_stream field, input
pp field
pp input
LetterTokenizer.new(input)
end
end

pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
pfa[:test] = TestAnalyzer.new
index = Index.new(:analyzer => pfa)
index << {:test => ‘foo’}
index.search_each(‘bar’)

Output:

:test
“”
:test
“bar”

Why is input “” the first time token_stream is called?

I hope that the answer isn’t “upgrade to 0.11”.

-ryan

ryank · April 2, 2007, 9:06pm

On 3/28/07, Ryan K. [email protected] wrote:

class TestAnalyzer
index << {:test => ‘foo’}

Why is input “” the first time token_stream is called?

I hope that the answer isn’t “upgrade to 0.11”.

FWIW, I upgraded to 0.11.3 on my test box and it didnt’ change
anything. Are my assumptions about PFA wrong? Or is there a bug?

-ryan

ryank · April 3, 2007, 11:40am

On Mon, Apr 02, 2007 at 11:57:37AM -0700, Ryan K. wrote:

On 3/28/07, Ryan K. [email protected] wrote:
[…]

FWIW, I upgraded to 0.11.3 on my test box and it didnt’ change
anything. Are my assumptions about PFA wrong? Or is there a bug?

I guess that’s a bug - I can perfectly reproduce that behaviour here.

The funny thing is that this does not necessarily mean that it doesn’t
work as intended. Just for fun I wrote an analyzer that completely
ignores the input it should analyze, and always uses a fixed text
instead:

class TestAnalyzer
def token_stream field, input
ts = LetterTokenizer.new(“senseless standard text”)
puts “token_stream for :#{field} and input <#{input}>:
#{ts.inspect}\n #{ts.text}”
ts
end
end

a = TestAnalyzer.new
ts = a.token_stream :test, ‘foo bar’
puts ts.text # ‘senseless standard text’ as
expected

pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
pfa[:test] = TestAnalyzer.new
ts = pfa.token_stream :test, ‘foo bar’
puts ts.text # surprise: ‘foo bar’

I guess the pfa does not give the text to analyze via the token_stream
method, but sets it later by using the Tokenizer’s text=() method.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

ryank · April 3, 2007, 7:34pm

On 4/3/07, Jens K. [email protected] wrote:

work as intended. Just for fun I wrote an analyzer that completely

method, but sets it later by using the Tokenizer’s text=() method.
I don’t think so. I’ve tried overriding #text=, but it never gets
called.

-ryan

ryank · April 4, 2007, 10:24am

On Tue, Apr 03, 2007 at 10:29:49AM -0700, Ryan K. wrote:

On 4/3/07, Jens K. [email protected] wrote:
[…]
ts
puts ts.text # surprise: ‘foo bar’

I guess the pfa does not give the text to analyze via the token_stream
method, but sets it later by using the Tokenizer’s text=() method.
I don’t think so. I’ve tried overriding #text=, but it never gets called.

ok, then it’s happening somewhere else - in ferret’s analysis.c there’s
a method a_standard_get_ts that clones an existing token stream instance
and calls a method named reset on it, with the text to be tokenized.

I guess we’ll need Dave’s help to sort this out…

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

ryank · April 6, 2007, 10:32pm

On 4/5/07, David B. [email protected] wrote:

class TestAnalyzer

ok, then it’s happening somewhere else - in ferret’s analysis.c there’s
require ‘pp’
end
def token_stream field, input
index.search_each(‘bar’)
initialize => (bar)
reset => (one)

=>
initialize => (one)
initialize => (two)
initialize => (three)
So Ryan, you will now get the output you expect. It will require
updating to Ferret 0.11.4 though. Is there any reason this is a
problem?

I’m at the point where I need to upgrade for other reason anyway, so
it shouldn’t be a problem.

Thanks for your help.

-ryan

ryank · April 6, 2007, 7:03am

On 4/4/07, Jens K. [email protected] wrote:

def token_stream field, input
pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
a method a_standard_get_ts that clones an existing token stream instance
and calls a method named reset on it, with the text to be tokenized.

I guess we’ll need Dave’s help to sort this out…

Ok, I can see why this is confusing. To try and show you how it works,
try this code;

require ‘rubygems’
require ‘ferret’
require ‘pp’
require ‘strscan’

include Ferret::Analysis
include Ferret::Index

class TestAnalyzer
class TestTokenizer
def initialize(input)
puts “initialize => (#{input})”
@input = input
end
def next()
term, @input = @input, nil
return term ? Token.new(term, 0, term.size) : nil
end
def text=(text)
puts “reset => (#{text})”
@input = text
end
end

def token_stream field, input
  pp field
  pp input
  TestTokenizer.new(input)
end

end

pfa = PerFieldAnalyzer.new(StandardAnalyzer.new())
pfa[:test] = TestAnalyzer.new
index = Index.new(:analyzer => pfa)
index << {:test => ‘foo’}
index.search_each(‘bar’)

The output is;

:test
“”
initialize => ()
r_analysis.c, 563: cwrts_reset #<= debugging bug :-0
reset => (foo)
:test
“bar”
initialize => (bar)

There is a stray debugging comment in there which I’m embarrassed I
didn’t pick up earlier. But otherwise it should show you what is
happening. The tokenizer gets created with an empty string and then
TestTokenizer#text= gets called. This was actually an optimization for
multi-string fields. For example;

index << {:test => [‘one’, ‘two’, ‘three’]}

=>

initialize => ()
reset => (one)
reset => (two)
reset => (three)

So the tokenizer only needs to be instantiated once and then it gets
reset for each string. This is good example of premature optimization,
particularly since most people will never even have multi-string
fields like this. Getting rid of this optimization makes things a lot
clearer. The next version of Ferret will give this output;

index << {:test => [‘one’, ‘two’, ‘three’]}

=>

initialize => (one)
initialize => (two)
initialize => (three)

So Ryan, you will now get the output you expect. It will require
updating to Ferret 0.11.4 though. Is there any reason this is a
problem?

Hope that helps,
Dave