Testing tokenizers

Onur_T · April 12, 2006, 2:19pm

Hi,
is there a way to test tokenizers? I mean, I want to give input stream
and see the output tokens.

AND is there a way to see an indexed document’s index tokens? Which
words in the document are used to index this document?

Thanks in advance
Onur

Onur_T · April 18, 2006, 3:59am

Hey Onur, just got back from a trip around Japan. You’ve probably
already worked out the answer to this question but here is how I test
tokenizers;

require 'ferret'
$stdin.each do |line|
  stk = Ferret::Analysis::StandardTokenizer.new(line)
  while tk = stk.next()
    puts "    <#{tk.text}> from #{tk.start_offset} to

#{tk.end_offset}"
end
end

And I run it like this;

ruby -r rubygems tz_tester.rb < file_to_tokenize.txt

You can just change the tokenizer to whaterver tokenizer you want to
test.

Hope that helps,
Dave