Ruby Forum Ferret > Problem if method is called during Analyzer.token_stream operation

Posted by S D (Guest)
on 23.04.2008 06:54
(Received via mailing list)
I've written a tokenizer/analyzer that parses a file extracting tokens 
and
operate this analyzer/tokenizer on ASCII data consisting of XML files 
(the
tokenizer skips over XML elements but maintains relative positioning). 
I've
written many units tests to check the produced token stream and was
confident that the tokenizer was working properly. Then I noticed two
problems:

   1. StopFilter (using English stop words) does not properly filter the
   token stream output from my tokenizer. If I explicitly pass an array 
of stop
   words to the stop filter it still doesn't work. If I simply switch my
   tokenizer to a StandardTokenizer the stop words are appropriately 
filtered
   (of course the XML tags are treated differently).
   2. When I try a simple search no results come up. I can see that my
   tokenizer is adding files to the index but a simple search (using
   Ferret::Index::Index.search_each) produces no results.

I'm now trying to track down the above problem which seems to have led 
me to
another (though possibly related) problem for which I am seeking an 
answer.
Below is the token_stream() method of my analyzer (XMLAnalyzer). Note 
that
I've commented out my custom tokenizer (XMLTokenizer) so that the
StandardTokenizer is being used within my custom analyzer.
     def token_stream(field, str)
          # ts = XMLTokenizer.new(str)
          ts = StandardTokenizer.new(str)
          # test_token_stream(ts)
          ts
     end
In the above I've commented out the test_token_stream() method taken 
from
Balmain's Ferret book (O'Reilly, pg 68) that simply prints out the 
tokens
contained within a stream; i.e.,:
     def test_token_stream(token_stream)
          puts "\033[32mStart | End | PosInc | Text\033[m"
          while tkn = token_stream.next
               puts "%5d |%4d |%5d   | %s" % [tkn.start, tkn.end,
tkn.pos_inc, tkn.text]
          end
    end

If I keep test_token_stream() commented out then the indexing and search
work fine (using StandardTokenizer). However, if I do not comment out
test_token_stream() then creating the index appears to work fine but a
search produces no results. I haven't been able to track this down but
thought it might be related to the problems I was having with 
XMLTokenizer.
Note that I create my index with the Ferret::Index::Index

  index = Index::Index.new(:analyzer => XMLAnalyzer.new(),
                                           :path => 
options.indexLocation,
                                           :create_if_missing => true)

and I perform searches using Ferret::Search::Searcher

Any thoughts would be appreciated.

Regards,
John
aka sd.codewarrior
Posted by Jens Krämer (jkraemer)
on 23.04.2008 10:14
(Received via mailing list)
Hi!

First guess - the test_token_stream method removes items from the stream
by calling next(), so the stream is empty when you return it, and Ferret
has nothing left to index.

Cheers,
Jens

On Wed, Apr 23, 2008 at 12:50:25AM -0400, S D wrote:
>    tokenizer to a StandardTokenizer the stop words are appropriately filtered
>      def token_stream(field, str)
>           while tkn = token_stream.next
> Note that I create my index with the Ferret::Index::Index
> John
> aka sd.codewarrior

> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk@rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database