Search multiple models

frocco · April 26, 2006, 12:41pm

Hello,

Lets say you have a few models like Post, Article, Wiki, Comment, And
you want to use ferret to search all of them at once. How would I set up
the latest acts_as_ferret to accomplish this? And what would be fastest
for searches? 1 index for all models, or have an index per model?

Thank you

frocco · May 3, 2006, 6:56pm

Hi,

and sorry for the late reply.

On Wed, Apr 26, 2006 at 12:41:10PM +0200, Frank Rosquin wrote:

Hello,

Lets say you have a few models like Post, Article, Wiki, Comment, And
you want to use ferret to search all of them at once. How would I set up
the latest acts_as_ferret to accomplish this? And what would be fastest
for searches? 1 index for all models, or have an index per model?

Which would be fastest depends on the type of your queries. If most of
your queries search all models at once, a single index should be faster.

If you tend to query mainly a single model and queries across all models
are the exception, the index-per-model approach should be better suited.

However the difference won’t matter until you get to really big indexes.

If you go the multiple index route (declaring acts_as_ferret in each of
the models you want to search), you can use the

multi_search(query, additional_models = [], options = {})

method on any of these model classes, giving the list of all other model
classes to search through as the second parameter. the options hash is
the same as for find_by_contents. You have to add the
:store_class_name => true option to your acts_as_ferret calls. That
turns class name storage in the indexes on and let’s multi_search know
what class to query for a given hit.

For the single index route, using Rails single table inheritance is the
easiest approach. Just call acts_as_ferret once in your base class, and
use find_by_contents as usual. This is known to work, I use this with
Typo’s Content base class.

If this is no option for you, you can configure each model class to use
the same index directory. This approach should work but hasn’t got much
(if any) testing so far.
One problem here is that we use the id column as a key in ferret
indexes,
too. So the id has to be unique across the models you want to search.
In addition, you would be on your own for querying the index, I don’t
think any of the existing searching methods will work out of the box in
this scenario. The :store_class_name option to acts_as_ferret should
be useful in this contextm, too.

Patches regarding these issues would be very welcome - my
hacking time is quite constrained atm…

After all, I’d either suggest the STI approach or, if that doesn’t fit,
the multi-index route.

hope this helps,

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

frocco · October 23, 2006, 12:41am

hey guys, any idea how to use those options with multi_search

I tried it on find_by_contents and it works fine, however, for
multi_search i do:

@results =
User.multi_search(parse(@query),[Book],{:offset=>0,:limit=>5})

or

@results = User.multi_search(parse(@query),[Book],:offset=>0,:limit=>5)

and neither works, however I get no error either. Whats wrong?

frocco · December 27, 2006, 6:05pm

Hi all,

just started to play with (acts_as_)ferret a couple of hours ago, when I
learned that ferret supports fuzzy search.

I could not find an answer to the problem i need to solve yet:

I have a few models with one to many relations to Clients: Addresses,
Contacts, Phone numbers, etc.

i.e. a client may have many addresses and so on.

I need to match a “flat” (each attribute only once) client record
against all the models attributes mentioned above and get a list of
clients with descending probability of being a duplicate.

Is this possible? Which options should I use to save memory and
performance?

Thanks in advance! - Bernd

frocco · January 10, 2007, 10:02am

On Wed, Dec 27, 2006 at 06:05:46PM +0100, Martin Bernd S. wrote:

i.e. a client may have many addresses and so on.

I need to match a “flat” (each attribute only once) client record
against all the models attributes mentioned above and get a list of
clients with descending probability of being a duplicate.

Is this possible?

As a first try I’d build a single Ferret document for each client,
containing all his contacts, addresses and phone numbers. For better
results you could keep all addresses in one field, phone numbers in
another, and contact names in a third field.

Then take each record you suspect being a duplicate and build a query
from it, using the same way of distributing the data to different
fields.
Running that query against the index should give you a list of possible
duplicate records sorted by relevance.

Which options should I use to save memory and
performance?

There seems to be no need to store the field contents themselves in the
index, so this should be turned off with :store => :no when the index is
created. Otherwise I’d first make it work and then look if further
optimization is needed at all - Ferret is really fast.

cheers,
Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

frocco · October 23, 2006, 11:42am

On Mon, Oct 23, 2006 at 12:41:08AM +0200, Eric G. wrote:

@results = User.multi_search(parse(@query),[Book],:offset=>0,:limit=>5)

and neither works, however I get no error either. Whats wrong?

that’s not implemented yet, but there’s a patch in trac I plan to
integrate into the next release of aaf.

http://projects.jkraemer.net/acts_as_ferret/ticket/60

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

frocco · January 16, 2007, 12:40pm

On Tue, Jan 16, 2007 at 11:49:12AM +0100, Martin Bernd S. wrote:

Hi Jens,

[…]

The main issues we have is the well known locking problem and the
scores.

Making sure you only have one process writing to the index, i.e. via
an indexer running in backgroundrb, should solve these issues.

The scores leave us with the problem that - while the order seems to be
correct - we don’t know where to cut the line to display results and
what a relevant match is. For a dozen attributes I’ve seen scores from
0.something to 9.something, with a result close below 9 not even
looling similar while just above 9 seems to be a “99 percent” match.

the calculation of scores is quite complex. To get an idea what happens
in there you can use Ferret’s explain method (in
Ferret::Search::Searcher).

If someone would tell me - in case this is possible at all - how to
normalize the scores I’d be very happy.

no idea if this is possible - maybe you find some information about
this in the context of Lucene (i.e. in Eric Hatcher’s fine Lucene book
or on the lucene mailing list).

Another thing which I didn’t understand yet is what actually happens if
I do a multi token fuzzy search; currently I’m splitting the string up
in multiple tokens and build one query “attribute:token1~ AND
attribute:token2~ AND …”. Maybe not really what I should do to get
correct scores.

don’t know if there is another way to express this with ferret.

cheers,
Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

frocco · January 16, 2007, 11:49am

Hi Jens,

thanks for the answer. (Because of time constraints) I solved the
problem in a different way, i.e. providing each model a client_id method
and then summing up the individual fuzzy search results for each
attribute.

I guess this is neither legant nor performant and I’m not happy with the
resulting scores. But we can live with it for now.

The main issues we have is the well known locking problem and the
scores.

The scores leave us with the problem that - while the order seems to be
correct - we don’t know where to cut the line to display results and
what a relevant match is. For a dozen attributes I’ve seen scores from
0.something to 9.something, with a result close below 9 not even
looling similar while just above 9 seems to be a “99 percent” match.

If someone would tell me - in case this is possible at all - how to
normalize the scores I’d be very happy.

Another thing which I didn’t understand yet is what actually happens if
I do a multi token fuzzy search; currently I’m splitting the string up
in multiple tokens and build one query “attribute:token1~ AND
attribute:token2~ AND …”. Maybe not really what I should do to get
correct scores.

Anyways, thanks for your work and for answering my post.

frocco · January 16, 2007, 12:50pm

Again thanks for the answers.

I did read the score formula, but my maths knowlege is almost gone now.
I’ll look at the docs again if I have more spare time. With the most
relavant word I should be able to scale the scores to percentage.

We have multiple servers running, so we definitly have concurrency
problems. I just didn’t play with stuff like backgroundrb yet, so I need
to investigate on how to implemment a single writer solution. But thanks
for the hint.