I’ve written a tokenizer/analyzer that parses a file extracting tokens
operate this analyzer/tokenizer on ASCII data consisting of XML files
tokenizer skips over XML elements but maintains relative positioning).
written many units tests to check the produced token stream and was
confident that the tokenizer was working properly. Then I noticed two
- StopFilter (using English stop words) does not properly filter the
token stream output from my tokenizer. If I explicitly pass an array
words to the stop filter it still doesn’t work. If I simply switch my
tokenizer to a StandardTokenizer the stop words are appropriately
(of course the XML tags are treated differently).
- When I try a simple search no results come up. I can see that my
tokenizer is adding files to the index but a simple search (using
Ferret::Index::Index.search_each) produces no results.
I’m now trying to track down the above problem which seems to have led
another (though possibly related) problem for which I am seeking an
Below is the token_stream() method of my analyzer (XMLAnalyzer). Note
I’ve commented out my custom tokenizer (XMLTokenizer) so that the
StandardTokenizer is being used within my custom analyzer.
def token_stream(field, str)
# ts = XMLTokenizer.new(str)
ts = StandardTokenizer.new(str)
In the above I’ve commented out the test_token_stream() method taken
Balmain’s Ferret book (O’Reilly, pg 68) that simply prints out the
contained within a stream; i.e.,:
puts “\033[32mStart | End | PosInc | Text\033[m”
while tkn = token_stream.next
puts “%5d |%4d |%5d | %s” % [tkn.start, tkn.end,
If I keep test_token_stream() commented out then the indexing and search
work fine (using StandardTokenizer). However, if I do not comment out
test_token_stream() then creating the index appears to work fine but a
search produces no results. I haven’t been able to track this down but
thought it might be related to the problems I was having with
Note that I create my index with the Ferret::Index::Index
index = Index::Index.new(:analyzer => XMLAnalyzer.new(),
:create_if_missing => true)
and I perform searches using Ferret::Search::Searcher
Any thoughts would be appreciated.