Forum: Ferret How to make custom TokenFilter?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
5dff751ad47fd9c497d1632b4dd17529?d=identicon&s=25 James Kim (coti22)
on 2007-04-08 07:05
In the O'reilly Ferret short cuts, I found very useful example for me.
It explains how to make custom Tokenizer.
But that book doesn't explain how to make custom Filter.
(especially, how to implement the #text=() method)

I'm a newbee and I don't understand how do I create my own custom
Filter.
Are there some good source code examples??
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-09 08:12
(Received via mailing list)
On 4/8/07, James Kim <sjoonk@gmail.com> wrote:
> In the O'reilly Ferret short cuts, I found very useful example for me.
> It explains how to make custom Tokenizer.
> But that book doesn't explain how to make custom Filter.
> (especially, how to implement the #text=() method)
>
> I'm a newbee and I don't understand how do I create my own custom
> Filter.
> Are there some good source code examples??

Hi James,

Thanks for buying the Ferret ShortCut. I'm assuming you're talking
about implementing a custom TokenFilter and not a custom Filter (which
filters search results and can be implemented itself in two different
ways). Here is an example of a custom TokenFilter which reverses
tokens (obviously just a toy example);

  class MyReverseTokenFilter < TokenStream
    def initialize(token_stream)
      @token_stream = token_stream
    end

    def text=(text)
      @token_stream.text = text
    end

    def next()
      if token = @token_stream.next
        token.text = token.text.reverse
      end
      token
    end
  end


Notice that I did;

  token.text = token.text.reverse

And not;

  token.text.reverse!

You can't change the string in place as the text and text= methods are
fetching and setting a string inside a C struct. Obviously the same
goes for sub!, downcase!, lstrip! etc.

Let me know if you need any more help with this.

Cheers,
Dave
5dff751ad47fd9c497d1632b4dd17529?d=identicon&s=25 James Kim (coti22)
on 2007-04-09 17:19
Hi, Dave.

I'm very thanks for your kind explanation and I got it.
(I was very pleased when I bought your shortcut, it's very very
useful..)

Anyway I have two more question.

1. Is there any difference between extending TokenStream class
and just using CustomTokenFilter only without extending TokenStream?

2. What exactly the text=() method's purpose?
In Lucene, as I know, there is no method of that name.
What's the matter when I didn't implement this method?

Thank you.
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-12 11:45
(Received via mailing list)
On 4/10/07, James Kim <sjoonk@gmail.com> wrote:
> Hi, Dave.
>
> I'm very thanks for your kind explanation and I got it.
> (I was very pleased when I bought your shortcut, it's very very
> useful..)
>
> Anyway I have two more question.
>
> 1. Is there any difference between extending TokenStream class
> and just using CustomTokenFilter only without extending TokenStream?

No, currently there is no difference and therefore there is no need
for you to extend TokenStream. I may extend the TokenStream class in
the future making it necessary or at least advantageous to extend it
but I can't see how or why I would do this at the moment. At any rate,
I'll give plenty of warning if I do make it necessary to extend
TokenStream so it is up to you if you want to.

> 2. What exactly the text=() method's purpose?
> In Lucene, as I know, there is no method of that name.
> What's the matter when I didn't implement this method?

It was an unnecessary "optimization". It allows you to use a single
TokenStream to tokenize multiple strings. As of Ferret 0.11.4, Ferret
shouldn't use it anymore (although it does still get used in the unit
tests) so you should be able leave it out of your implementation. If
you do run into problems by not implementing the text=() method then
it is a bug so please let me know.
This topic is locked and can not be replied to.