Forum: Ferret Extending/Modifying QueryParser

5a90462f5f0940f09a76bf1e39bed595?d=identicon&s=25 Mitchell Curtis Hatter (Guest)
on 2007-07-07 05:29
(Received via mailing list)
Hi,

I've implemented synonym searching in my rails application but have
an idea I'd like to implement but can't figure out how to do. The
idea is that I'd like to give the end user the choice on whether to
search for the synonym of a word or not. Preferably by extending the
query language to parse a construct similar to '%word1' and then have
the word turned into a or list (i.e., word1|word2|word3|...).

Currently, the query parser constantly calls SynonymTokenFilter to
get synonyms for each token. Is there a way I can go about achieving
this functionality?

Here's an overview of what I've done so far:

My model classes in my rails app use acts_as_ferret with a call that
looks like:

acts_as_ferret(
     :fields => [:body],
     :store_class_name => true,
     :ferret => {
         :or_default => false,
         :analyzer => SynonymAnalyzer.new(WordnetSynonymEngine.new, [])
     }
)


I created a SynonymAnalyzer and SynonymTokenFilter:

class SynonymAnalyzer < Ferret::Analysis::Analyzer
   include Ferret::Analysis

   def initialize(synonym_engine, stop_words =
FULL_ENGLISH_STOP_WORDS, lower = true)
     @synonym_engine = synonym_engine
     @lower = lower
     @stop_words = stop_words
   end

   def token_stream(field, str)
     ts = StandardTokenizer.new(str)
     ts = LowerCaseFilter.new(ts) if @lower
     ts = StopFilter.new(ts, @stop_words)
     ts = SynonymTokenFilter.new(ts, @synonym_engine)
   end
end

class SynonymTokenFilter < Ferret::Analysis::TokenStream
   include Ferret::Analysis

   def initialize(token_stream, synonym_engine)
     @token_stream = token_stream
     @synonym_stack = []
     @synonym_engine = synonym_engine
   end

   def text=(text)
     @token_stream.text = text
   end

   def next
     return @synonym_stack.pop if @synonym_stack.size > 0

     if token = @token_stream.next
       add_synonyms_to_stack(token) unless token.nil?
     end

     return token
   end

   private
   def add_synonyms_to_stack(token)
     synonyms = @synonym_engine.get_synonyms(token.text)

     return if synonyms.nil?

     synonyms.each do |s|
       @synonym_stack.push(
         Token.new(s, token.start, token.end, 0))
     end
   end
end

FInally a WordnetSynonymEngine that queries my wordnet index I created:

class WordnetSynonymEngine
   include Ferret::Search

   def initialize(index_name = "wordnet")
     @searcher = Searcher.new("#{RAILS_ROOT}/index/#{ENV
['RAILS_ENV']}/#{index_name}")
   end

   def get_synonyms(word)
     @searcher.search_each(TermQuery.new(:word, word)) do |doc_id,
score|
       return @searcher[doc_id][:syn]
     end

     return nil
   end
end


It works great except that I'd really like that ability to only run
tokens through the SynonymTokenFilter when they are prepended by an
unescaped % sign.

Also, if anyone is interested I can post the code for turning the
wordnet prolog database into a ferret database (primarily recoding
the java lucene program that did the same thing to ruby and ferret).

Thanks,
Curtis
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2007-07-10 10:15
(Received via mailing list)
On Fri, Jul 06, 2007 at 11:18:09PM -0400, Mitchell Curtis Hatter wrote:
> get synonyms for each token. Is there a way I can go about achieving
> this functionality?

You have to extend Ferret's Query Parser to achieve this. If you don't
want to mess around with the grammar stuff the parser code is generated
from, you could also preprocess user queries to modify them accordingly
before giving them to the QueryParser. Can get complicated, too ;-)

Atm you're doing the synonym stuff twice, once at indexing time and once
when Queries are parsed. Because of the insertion of synonyms in the
index at indexing time, adding synonyms to Queries is not really needed
any more.

So you don't really want to specify your SynonymAnalyzer for aaf as the
analyzer to use for indexing and searching (aaf doesn't support
different analyzers for indexing/searching bec. in general it's a good
idea to use the same analyzer in both cases).

If you used plain Ferret and wanted Synonyms everywhere or in a specific
field, but for ALL queries, you could use your Analyzer at indexing
time,
but not for Query parsing. In your case, using your WordnetEngine in a
customized QueryParser or a custom query preprocessor would be the
better way.

> Here's an overview of what I've done so far:

[..]

That's really cool stuff, would you mind posting this to Ferret's Wiki
so other people can more easily find it? If you included the
WordnetSynonymEngine that would be even better :-)

Cheers,
Jens



--
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer@webit.de | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa
5a90462f5f0940f09a76bf1e39bed595?d=identicon&s=25 Mitchell Curtis Hatter (Guest)
on 2007-07-10 20:13
(Received via mailing list)
> You have to extend Ferret's Query Parser to achieve this. If you don't
> want to mess around with the grammar stuff the parser code is
> generated
> from, you could also preprocess user queries to modify them
> accordingly
> before giving them to the QueryParser. Can get complicated, too ;-)

I do not enjoy writing parsers, and am not especially good at it. I
think first I'll check out the grammar for the parser and see if I
can modify that. Perhaps creating a SynonymQuery class?

I did consider preprocesing user queries and then just grouping the
resulting or'd query in parens: 'rabbit %{ferret}' would parse to
'rabbit (ferret|"black-footed ferret"|etc|etc)' but I'm sure there
are situations where that would not be good but it's an option.

>
> So you don't really want to specify your SynonymAnalyzer for aaf as
> the
> analyzer to use for indexing and searching (aaf doesn't support
> different analyzers for indexing/searching bec. in general it's a good
> idea to use the same analyzer in both cases).

Thanks, I was looking at aaf wondering how I could specific a
different analyzer to use for searches. I didn't find anything that
would really let me get a hold of the QueryParser to change the
analyzer used. Glad I wasn't just missing it.

>
> If you used plain Ferret and wanted Synonyms everywhere or in a
> specific
> field, but for ALL queries, you could use your Analyzer at indexing
> time,
> but not for Query parsing. In your case, using your WordnetEngine in a
> customized QueryParser or a custom query preprocessor would be the
> better way.

Since this isn't for anything but fun right now (at work I'm stuck
using Oracle's full text engine which has its own set of problems)
first I'll try modifying the QueryParser grammar to account for a new
query type. My C is not very good so hopefully won't have to do much,
but I like that solution better then having to write a pre-processor
for queries.

>
> That's really cool stuff, would you mind posting this to Ferret's Wiki
> so other people can more easily find it? If you included the
> WordnetSynonymEngine that would be even better :-)
>
> Cheers,
> Jens

Thanks, I've posted it to the Ferret wiki. It's quite long but I hope
that's not a problem. I included the wordnetSynonymEngine and created
a YAMLSynonymEngine just to show how it can be pluggable.

Thanks for the tips I'll see what I can accomplish,
Curtis
This topic is locked and can not be replied to.