Testing tokenizers

Hi,
is there a way to test tokenizers? I mean, I want to give input stream
and see the output tokens.

AND is there a way to see an indexed document’s index tokens? Which
words in the document are used to index this document?

Thanks in advance
Onur

Hey Onur, just got back from a trip around Japan. You’ve probably
already worked out the answer to this question but here is how I test
tokenizers;

require 'ferret'
$stdin.each do |line|
  stk = Ferret::Analysis::StandardTokenizer.new(line)
  while tk = stk.next()
    puts "    <#{tk.text}> from #{tk.start_offset} to 

#{tk.end_offset}"
end
end

And I run it like this;

ruby -r rubygems tz_tester.rb < file_to_tokenize.txt

You can just change the tokenizer to whaterver tokenizer you want to
test.

Hope that helps,
Dave

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs