Forum: Ferret testing tokenizers

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
14f4ca83f4d8a2f6434c1e2a291fb512?d=identicon&s=25 Onur Turgay (Guest)
on 2006-04-12 14:19
(Received via mailing list)
Hi,
is there a way to test tokenizers? I mean, I want to give input stream
and see the output tokens.

AND is there a way to see an indexed document's index tokens? Which
words in the document are used to index this document?

Thanks in advance
Onur
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-04-18 03:59
(Received via mailing list)
Hey Onur, just got back from a trip around Japan. You've probably
already worked out the answer to this question but here is how I test
tokenizers;

    require 'ferret'
    $stdin.each do |line|
      stk = Ferret::Analysis::StandardTokenizer.new(line)
      while tk = stk.next()
        puts "    <#{tk.text}> from #{tk.start_offset} to
#{tk.end_offset}"
      end
    end

And I run it like this;

    ruby -r rubygems tz_tester.rb < file_to_tokenize.txt

You can just change the tokenizer to whaterver tokenizer you want to
test.

Hope that helps,
Dave
This topic is locked and can not be replied to.