Testing tokenizers

is there a way to test tokenizers? I mean, I want to give input stream
and see the output tokens.

AND is there a way to see an indexed document’s index tokens? Which
words in the document are used to index this document?

Thanks in advance

Hey Onur, just got back from a trip around Japan. You’ve probably
already worked out the answer to this question but here is how I test

require 'ferret'
$stdin.each do |line|
  stk = Ferret::Analysis::StandardTokenizer.new(line)
  while tk = stk.next()
    puts "    <#{tk.text}> from #{tk.start_offset} to 


And I run it like this;

ruby -r rubygems tz_tester.rb < file_to_tokenize.txt

You can just change the tokenizer to whaterver tokenizer you want to

Hope that helps,

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs