First, thanks to Jens K. for pointing a stupid error on my part
the use of test_token_stream().
My current problem, a custom tokenizer I’ve written in Ruby does not
properly create an index (or at least searches on the index don’t work).
Using test_token_stream() I have verified that my tokenizer properly
the token_stream; certainly each Token’s attributes are set properly.
Nevertheless, simple searches return zero results.
The essence of my tokenizer is to skip beyond XML tags in a file and
up and return text components as tokens. I use this approach as opposed
an Hpricot approach because I need to keep track of the location of the
with respect to XML tags since after a search for a phrase I’ll want to
extract the nearby XML tags as they contain important context. My
(XMLTokenizer) contains a the obligatory initialize, next and text
(shown below) as well as a lot of parsing methods that are called at the
level by the method XMLTokenizer.get_next_token which is the primary
within next. I didn’t add the details of get_next_token as I’m assuming
if each token produced by get_next_token has the proper attributes then
shouldn’t be the cause of the problem. What more should I be looking
I’ve been looking for a custom tokenizer written in Ruby to model after;
def initialize(xmlText) @xmlText = xmlText.gsub(/[;,!]/, ' ') @currPtr = 0 @currWordStart = nil @currTextStart = 0 @nextTagStart = 0 @startOfTextRegion = 0 @currTextStart = \ XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) @nextTagStart = \ XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) @currPtr = @currTextStart @startOfTextRegion = 1 end def next tkn = get_next_token if tkn != nil puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc,
def text=(text) initialize(text) @xmlText end
Below is text from a previous, related message that shows that
is not working:
- I’ve written a tokenizer/analyzer that parses a file extracting tokens and
> operate this analyzer/tokenizer on ASCII data consisting of XML
> tokenizer skips over XML elements but maintains relative
> written many units tests to check the produced token stream and was
> confident that the tokenizer was working properly. Then I noticed
> 1. StopFilter (using English stop words) does not properly filter
> token stream output from my tokenizer. If I explicitly pass an
array of stop
> words to the stop filter it still doesn’t work. If I simply
> tokenizer to a StandardTokenizer the stop words are
> (of course the XML tags are treated differently).
- When I try a simple search no results come up. I can see that my
> tokenizer is adding files to the index but a simple search (using
> Ferret::Index::Index.search_each) produces no results.
Any suggestions are appreciated.