A few questions: Tweaking StemFilter, indexes,

phillyloco · January 21, 2007, 6:11pm

Hello all,

I am new to the list, but I have been using ferret for a little bit
already. I would first like to thank Dave for all his work on ferret.

I had a few questions that I haven’t been able to figure out after
messing around with ferret and going through the documentation.

StemFilter ------

I am trying to improve the quality of my searches in context of the
content of my application. I have created an analyzer using the
following:

StemFilter.new StopFilter.new(
LowerCaseFilter.new(StandardTokenizer.new(text)), @stop_words )

This has been pretty good so far, however, I really would like to get
a search for “plumber” match “plumbing” at maybe a lower score than it
would match “plumbers”. The thing is that plumber(s) is filtered to
“plumber” and plumbing is filtered to plumb, so it doesn’t match. Is
there any way to tweak the filter to be able to do these matches? I
would like to match all noun and verbs together (and ideally with a
lower score than different verb conjugations would match). Another
example would be driving and driver.

Worst case scenario, I could probably do some preprocessing to the
search queries to expand “plumber” or “driving” to a query that
includes both stems (for example expand the query for plumber to
“plumber plumb”)

Indexes —

I was wondering how exactly indexes are implemented under the hood and
if there is a way to give hints to ferret as to how our queries will
be formed in order to optimize performance. Maybe I’m thinking of
ferret too much as a database, but I am not too familiar with what’s
under ferret’s hood.

The reason I ask is that for the project I am working on, I have huge
amounts of text to search, but each item also has a location
associated with it (longitude & lattitude) and each query will only
want to search the text located in a specific area (point and radius).
I can add ranged parameters to the query and that will work, but is
that optimal? Hopefully I am making sense.

Donations —

I was wondering if there is a page that lists the total amount of
donations so far?

Thanks,
-carl

phillyloco · January 22, 2007, 1:38am

Hi,

You could use a FuzzyQuery, that will match words that have some degree
of resemblance, with lower score.

phillyloco · January 22, 2007, 6:13pm

Excerpts from Carl L.'s message of Sun Jan 21 09:09:59 -0800 2007:

Worst case scenario, I could probably do some preprocessing to the
search queries to expand “plumber” or “driving” to a query that
includes both stems (for example expand the query for plumber to
“plumber plumb”)

You can either do query expansion or you can modify the stemmer. Query
expansion is probably a little easier to experiment with because you
don’t have to worry about reindexing, but it does come with a
search-time cost which may or may not be negligible. (And it gets a
little tricky with phrasal queries.)

I can add ranged parameters to the query and that will work, but is
that optimal? Hopefully I am making sense.

I don’t know for sure whether Ferret is sophisticated enough to optimize
retrieval based on multiple ranges, but it may very well be. In any
case, I think you’re doing the right thing.