Custom Tokenizer not working

[unfortunately I received my messages as a batched digest…hence, I’m
forced to respond in a new thread. I’ve requested the administrator to
change my config to receive each message on this list. Sorry for any
inconvenience]

Thanks for the response below. Here is XMLAnalyzer (currently I’m not
using
the stop or lower case filter):

class XMLAnalyzer < Ferret::Analysis::Analyzer
def initialize(synonym_engine = nil,
stop_words = FULL_ENGLISH_STOP_WORDS,
lower = true)
@synonym_engine = synonym_engine
@lower = lower
@stop_words = stop_words
end

def token_stream(field, str)
  # ts = XMLTokenizer.new(str)
  ts = StandardTokenizer.new(str)
  # test_token_stream(ts)
  return ts
end

end

I just tried running ferret-browser by pointing to an index created with
StandardTokenizer and got the error below in Firefox. Is there any
configuration that is necessary? Presumably the defaults should work.

John

Internal Server Error No such file or directory -
/usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml

WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301

Hi!

On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote:
[…]

  • My current problem, a custom tokenizer I’ve written in Ruby does not
    > properly create an index (or at least searches on the index don’t
    work).
    > Using test_token_stream() I have verified that my tokenizer properly
    creates
    > the token_stream; certainly each Token’s attributes are set
    properly.
    > Nevertheless, simple searches return zero results.

Could you have a look at your index with the ferret_browser utility? It
allows you to check what exactly has been indexed and that maybe leads
to the root of your problem.

What does your analyzer, where you use the Tokenizer, look like? Is your
next() method below being called and working correctly when test driving
your analyzer i.e. in irb?

Cheers,
Jens

  • The essence of my tokenizer is to skip beyond XML tags in a file and break
    > up and return text components as tokens. I use this approach as
    opposed to
    > an Hpricot approach because I need to keep track of the location of
    the text
    > with respect to XML tags since after a search for a phrase I’ll want
    to
    > extract the nearby XML tags as they contain important context. My
    tokenizer
    > (XMLTokenizer) contains a the obligatory initialize, next and text
    methods
    > (shown below) as well as a lot of parsing methods that are called at
    the top
    > level by the method XMLTokenizer.get_next_token which is the primary
    action
    > within next. I didn’t add the details of get_next_token as I’m
    assuming that
    > if each token produced by get_next_token has the proper attributes
    then it
    > shouldn’t be the cause of the problem. What more should I be looking
    for?
    > I’ve been looking for a custom tokenizer written in Ruby to model
    after; any
    > suggestions?
    >
    > def initialize(xmlText)
    > @xmlText = xmlText.gsub(/[;,!]/, ’ ')
    > @currPtr = 0
    > @currWordStart = nil
    > @currTextStart = 0
    > @nextTagStart = 0
    > @startOfTextRegion = 0
    >
    > @currTextStart =
    > XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText)
    > @nextTagStart =
    > XMLTokenizer.skip_beyond_current_text(@currTextStart,
    @xmlText)
    > @currPtr = @currTextStart
    > @startOfTextRegion = 1
    > end
    >
    > def next
    > tkn = get_next_token
    > if tkn != nil
    > puts “%5d |%4d |%5d | %s” % [tkn.start, tkn.end,
    tkn.pos_inc,
    > tkn.text]
    > end
    > return tkn
    > end
    >
    > def text=(text)
    > initialize(text)
    > @xmlText
    > end
    >
    > Below is text from a previous, related message that shows that
    StopFiltering
    > is not working:
    >
    > >* I’ve written a tokenizer/analyzer that parses a file extracting
    tokens and
    > > operate this analyzer/tokenizer on ASCII data consisting of
    XML files (the
    > > tokenizer skips over XML elements but maintains relative
    positioning). I’ve
    > > written many units tests to check the produced token stream and
    was
    > > confident that the tokenizer was working properly. Then I
    noticed two
    > > problems:
    > >
    > > 1. StopFilter (using English stop words) does not properly
    filter the
    > > token stream output from my tokenizer. If I explicitly pass
    an
    > array of stop
    > > words to the stop filter it still doesn’t work. If I simply
    switch my
    > > tokenizer to a StandardTokenizer the stop words are
    > appropriately filtered
    > > (of course the XML tags are treated differently).
    > >
    > >
    2. When I try a simple search no results come up. I can see
    that my
    > > tokenizer is adding files to the index but a simple search
    (using
    > > Ferret::Index::Index.search_each) produces no results.
    > *
    >
    >
    > Any suggestions are appreciated.
    >
    > John

> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
http://rubyforge.org/mailman/listinfo/ferret-talk
> http://rubyforge.org/mailman/listinfo/ferret-talk
*

Hi!

On Wed, Apr 23, 2008 at 01:59:32PM -0400, S D wrote:

[unfortunately I received my messages as a batched digest…hence, I’m
forced to respond in a new thread. I’ve requested the administrator to
change my config to receive each message on this list. Sorry for any
inconvenience]

Thanks for the response below. Here is XMLAnalyzer (currently I’m not using
the stop or lower case filter):

class XMLAnalyzer < Ferret::Analysis::Analyzer

could you try if not inheriting from Ferret’s Analyzer changes anything?
At least I usually don’t do that in my analyzers.

[…]

I just tried running ferret-browser by pointing to an index created with
StandardTokenizer and got the error below in Firefox. Is there any
configuration that is necessary? Presumably the defaults should work.
[…]
Internal Server Error No such file or directory -
/usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml

WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301

works just fine here (Ferret 0.11.6 / Ubuntu), just tried it out. The
location from the error message looks a bit strange to me, how did you
install ferret?

Cheers,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold