How to make custom TokenFilter?


#1

In the O’reilly Ferret short cuts, I found very useful example for me.
It explains how to make custom Tokenizer.
But that book doesn’t explain how to make custom Filter.
(especially, how to implement the #text=() method)

I’m a newbee and I don’t understand how do I create my own custom
Filter.
Are there some good source code examples??


#2

On 4/8/07, James K. removed_email_address@domain.invalid wrote:

In the O’reilly Ferret short cuts, I found very useful example for me.
It explains how to make custom Tokenizer.
But that book doesn’t explain how to make custom Filter.
(especially, how to implement the #text=() method)

I’m a newbee and I don’t understand how do I create my own custom
Filter.
Are there some good source code examples??

Hi James,

Thanks for buying the Ferret ShortCut. I’m assuming you’re talking
about implementing a custom TokenFilter and not a custom Filter (which
filters search results and can be implemented itself in two different
ways). Here is an example of a custom TokenFilter which reverses
tokens (obviously just a toy example);

class MyReverseTokenFilter < TokenStream
def initialize(token_stream)
@token_stream = token_stream
end

def text=(text)
  @token_stream.text = text
end

def next()
  if token = @token_stream.next
    token.text = token.text.reverse
  end
  token
end

end

Notice that I did;

token.text = token.text.reverse

And not;

token.text.reverse!

You can’t change the string in place as the text and text= methods are
fetching and setting a string inside a C struct. Obviously the same
goes for sub!, downcase!, lstrip! etc.

Let me know if you need any more help with this.

Cheers,
Dave


#3

Hi, Dave.

I’m very thanks for your kind explanation and I got it.
(I was very pleased when I bought your shortcut, it’s very very
useful…)

Anyway I have two more question.

  1. Is there any difference between extending TokenStream class
    and just using CustomTokenFilter only without extending TokenStream?

  2. What exactly the text=() method’s purpose?
    In Lucene, as I know, there is no method of that name.
    What’s the matter when I didn’t implement this method?

Thank you.


#4

On 4/10/07, James K. removed_email_address@domain.invalid wrote:

Hi, Dave.

I’m very thanks for your kind explanation and I got it.
(I was very pleased when I bought your shortcut, it’s very very
useful…)

Anyway I have two more question.

  1. Is there any difference between extending TokenStream class
    and just using CustomTokenFilter only without extending TokenStream?

No, currently there is no difference and therefore there is no need
for you to extend TokenStream. I may extend the TokenStream class in
the future making it necessary or at least advantageous to extend it
but I can’t see how or why I would do this at the moment. At any rate,
I’ll give plenty of warning if I do make it necessary to extend
TokenStream so it is up to you if you want to.

  1. What exactly the text=() method’s purpose?
    In Lucene, as I know, there is no method of that name.
    What’s the matter when I didn’t implement this method?

It was an unnecessary “optimization”. It allows you to use a single
TokenStream to tokenize multiple strings. As of Ferret 0.11.4, Ferret
shouldn’t use it anymore (although it does still get used in the unit
tests) so you should be able leave it out of your implementation. If
you do run into problems by not implementing the text=() method then
it is a bug so please let me know.