Extending/Modifying QueryParser


#1

Hi,

I’ve implemented synonym searching in my rails application but have
an idea I’d like to implement but can’t figure out how to do. The
idea is that I’d like to give the end user the choice on whether to
search for the synonym of a word or not. Preferably by extending the
query language to parse a construct similar to ‘%word1’ and then have
the word turned into a or list (i.e., word1|word2|word3|…).

Currently, the query parser constantly calls SynonymTokenFilter to
get synonyms for each token. Is there a way I can go about achieving
this functionality?

Here’s an overview of what I’ve done so far:

My model classes in my rails app use acts_as_ferret with a call that
looks like:

acts_as_ferret(
:fields => [:body],
:store_class_name => true,
:ferret => {
:or_default => false,
:analyzer => SynonymAnalyzer.new(WordnetSynonymEngine.new, [])
}
)

I created a SynonymAnalyzer and SynonymTokenFilter:

class SynonymAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis

def initialize(synonym_engine, stop_words =
FULL_ENGLISH_STOP_WORDS, lower = true)
@synonym_engine = synonym_engine
@lower = lower
@stop_words = stop_words
end

def token_stream(field, str)
ts = StandardTokenizer.new(str)
ts = LowerCaseFilter.new(ts) if @lower
ts = StopFilter.new(ts, @stop_words)
ts = SynonymTokenFilter.new(ts, @synonym_engine)
end
end

class SynonymTokenFilter < Ferret::Analysis::TokenStream
include Ferret::Analysis

def initialize(token_stream, synonym_engine)
@token_stream = token_stream
@synonym_stack = []
@synonym_engine = synonym_engine
end

def text=(text)
@token_stream.text = text
end

def next
return @synonym_stack.pop if @synonym_stack.size > 0

 if token = @token_stream.next
   add_synonyms_to_stack(token) unless token.nil?
 end

 return token

end

private
def add_synonyms_to_stack(token)
synonyms = @synonym_engine.get_synonyms(token.text)

 return if synonyms.nil?

 synonyms.each do |s|
   @synonym_stack.push(
     Token.new(s, token.start, token.end, 0))
 end

end
end

FInally a WordnetSynonymEngine that queries my wordnet index I created:

class WordnetSynonymEngine
include Ferret::Search

def initialize(index_name = “wordnet”)
@searcher = Searcher.new("#{RAILS_ROOT}/index/#{ENV
[‘RAILS_ENV’]}/#{index_name}")
end

def get_synonyms(word)
@searcher.search_each(TermQuery.new(:word, word)) do |doc_id,
score|
return @searcher[doc_id][:syn]
end

 return nil

end
end

It works great except that I’d really like that ability to only run
tokens through the SynonymTokenFilter when they are prepended by an
unescaped % sign.

Also, if anyone is interested I can post the code for turning the
wordnet prolog database into a ferret database (primarily recoding
the java lucene program that did the same thing to ruby and ferret).

Thanks,
Curtis


#2

On Fri, Jul 06, 2007 at 11:18:09PM -0400, Mitchell Curtis Hatter wrote:

get synonyms for each token. Is there a way I can go about achieving
this functionality?

You have to extend Ferret’s Query Parser to achieve this. If you don’t
want to mess around with the grammar stuff the parser code is generated
from, you could also preprocess user queries to modify them accordingly
before giving them to the QueryParser. Can get complicated, too :wink:

Atm you’re doing the synonym stuff twice, once at indexing time and once
when Queries are parsed. Because of the insertion of synonyms in the
index at indexing time, adding synonyms to Queries is not really needed
any more.

So you don’t really want to specify your SynonymAnalyzer for aaf as the
analyzer to use for indexing and searching (aaf doesn’t support
different analyzers for indexing/searching bec. in general it’s a good
idea to use the same analyzer in both cases).

If you used plain Ferret and wanted Synonyms everywhere or in a specific
field, but for ALL queries, you could use your Analyzer at indexing
time,
but not for Query parsing. In your case, using your WordnetEngine in a
customized QueryParser or a custom query preprocessor would be the
better way.

Here’s an overview of what I’ve done so far:

[…]

That’s really cool stuff, would you mind posting this to Ferret’s Wiki
so other people can more easily find it? If you included the
WordnetSynonymEngine that would be even better :slight_smile:

Cheers,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#3

You have to extend Ferret’s Query Parser to achieve this. If you don’t
want to mess around with the grammar stuff the parser code is
generated
from, you could also preprocess user queries to modify them
accordingly
before giving them to the QueryParser. Can get complicated, too :wink:

I do not enjoy writing parsers, and am not especially good at it. I
think first I’ll check out the grammar for the parser and see if I
can modify that. Perhaps creating a SynonymQuery class?

I did consider preprocesing user queries and then just grouping the
resulting or’d query in parens: ‘rabbit %{ferret}’ would parse to
‘rabbit (ferret|“black-footed ferret”|etc|etc)’ but I’m sure there
are situations where that would not be good but it’s an option.

So you don’t really want to specify your SynonymAnalyzer for aaf as
the
analyzer to use for indexing and searching (aaf doesn’t support
different analyzers for indexing/searching bec. in general it’s a good
idea to use the same analyzer in both cases).

Thanks, I was looking at aaf wondering how I could specific a
different analyzer to use for searches. I didn’t find anything that
would really let me get a hold of the QueryParser to change the
analyzer used. Glad I wasn’t just missing it.

If you used plain Ferret and wanted Synonyms everywhere or in a
specific
field, but for ALL queries, you could use your Analyzer at indexing
time,
but not for Query parsing. In your case, using your WordnetEngine in a
customized QueryParser or a custom query preprocessor would be the
better way.

Since this isn’t for anything but fun right now (at work I’m stuck
using Oracle’s full text engine which has its own set of problems)
first I’ll try modifying the QueryParser grammar to account for a new
query type. My C is not very good so hopefully won’t have to do much,
but I like that solution better then having to write a pre-processor
for queries.

That’s really cool stuff, would you mind posting this to Ferret’s Wiki
so other people can more easily find it? If you included the
WordnetSynonymEngine that would be even better :slight_smile:

Cheers,
Jens

Thanks, I’ve posted it to the Ferret wiki. It’s quite long but I hope
that’s not a problem. I included the wordnetSynonymEngine and created
a YAMLSynonymEngine just to show how it can be pluggable.

Thanks for the tips I’ll see what I can accomplish,
Curtis