Ruby Forum Ferret > Custom Tokenizer not working

Posted by S D (Guest)
on 23.04.2008 18:25
(Received via mailing list)
First, thanks to Jens K. for pointing a stupid error on my part 
regarding
the use of test_token_stream().

My current problem, a custom tokenizer I've written in Ruby does not
properly create an index (or at least searches on the index don't work).
Using test_token_stream() I have verified that my tokenizer properly 
creates
the token_stream; certainly each Token's attributes are set properly.
Nevertheless, simple searches return zero results.

The essence of my tokenizer is to skip beyond XML tags in a file and 
break
up and return text components as tokens. I use this approach as opposed 
to
an Hpricot approach because I need to keep track of the location of the 
text
with respect to XML tags since after a search for a phrase I'll want to
extract the nearby XML tags as they contain important context. My 
tokenizer
(XMLTokenizer) contains a the obligatory initialize, next and text 
methods
(shown below) as well as a lot of parsing methods that are called at the 
top
level by the method XMLTokenizer.get_next_token which is the primary 
action
within next. I didn't add the details of get_next_token as I'm assuming 
that
if each token produced by get_next_token has the proper attributes then 
it
shouldn't be the cause of the problem. What more should I be looking 
for?
I've been looking for a custom tokenizer written in Ruby to model after; 
any
suggestions?

    def initialize(xmlText)
      @xmlText = xmlText.gsub(/[;,!]/, ' ')
      @currPtr = 0
      @currWordStart = nil
      @currTextStart = 0
      @nextTagStart = 0
      @startOfTextRegion = 0

      @currTextStart = \
        XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)
      @nextTagStart = \
        XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText)
      @currPtr = @currTextStart
      @startOfTextRegion = 1
    end

    def next
      tkn = get_next_token
      if tkn != nil
        puts "%5d |%4d |%5d   | %s" % [tkn.start, tkn.end, tkn.pos_inc,
tkn.text]
      end
      return tkn
    end

    def text=(text)
      initialize(text)
      @xmlText
    end

Below is text from a previous, related message that shows that 
StopFiltering
is not working:

>* I've written a tokenizer/analyzer that parses a file extracting tokens and
*>* operate this analyzer/tokenizer on ASCII data consisting of XML 
files (the
*>* tokenizer skips over XML elements but maintains relative 
positioning). I've
*>* written many units tests to check the produced token stream and was
*>* confident that the tokenizer was working properly. Then I noticed 
two
*>* problems:
*>*
*>*    1. StopFilter (using English stop words) does not properly filter 
the
*>*    token stream output from my tokenizer. If I explicitly pass an
array of stop
*>*    words to the stop filter it still doesn't work. If I simply 
switch my
*>*    tokenizer to a StandardTokenizer the stop words are
appropriately filtered
*>*    (of course the XML tags are treated differently).
*>
>*    2. When I try a simple search no results come up. I can see that my
*>*    tokenizer is adding files to the index but a simple search (using
*>*    Ferret::Index::Index.search_each) produces no results.
*


Any suggestions are appreciated.

John
Posted by Jens Krämer (jkraemer)
on 23.04.2008 19:09
(Received via mailing list)
Hi!

On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote:
[..]
> My current problem, a custom tokenizer I've written in Ruby does not
> properly create an index (or at least searches on the index don't work).
> Using test_token_stream() I have verified that my tokenizer properly creates
> the token_stream; certainly each Token's attributes are set properly.
> Nevertheless, simple searches return zero results.

Could you have a look at your index with the ferret_browser utility? It
allows you to check what exactly has been indexed and that maybe leads
to the root of your problem.

What does your analyzer, where you use the Tokenizer, look like? Is your
next() method below being called and working correctly when test driving
your analyzer i.e. in irb?

Cheers,
Jens

> shouldn't be the cause of the problem. What more should I be looking for?
> 
>       if tkn != nil
> 
> *>*    1. StopFilter (using English stop words) does not properly filter the
> *
> 
> 
> Any suggestions are appreciated.
> 
> John

> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk@rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database