Forum: Ferret Search multiple models

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
1dfc8345197fa4646a4fa3c1d89fa5ef?d=identicon&s=25 Frank Rosquin (cnf)
on 2006-04-26 12:41
Hello,

Lets say you have a few models like Post, Article, Wiki, Comment, And
you want to use ferret to search all of them at once. How would I set up
the latest acts_as_ferret to accomplish this? And what would be fastest
for searches? 1 index for all models, or have an index per model?

Thank you
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2006-05-03 18:56
(Received via mailing list)
Hi,

and sorry for the late reply.

On Wed, Apr 26, 2006 at 12:41:10PM +0200, Frank Rosquin wrote:
> Hello,
>
> Lets say you have a few models like Post, Article, Wiki, Comment, And
> you want to use ferret to search all of them at once. How would I set up
> the latest acts_as_ferret to accomplish this? And what would be fastest
> for searches? 1 index for all models, or have an index per model?

Which would be fastest depends on the type of your queries. If most of
your queries search all models at once, a single index should be faster.

If you tend to query mainly a single model and queries across all models
are the exception, the index-per-model approach should be better suited.

However the difference won't matter until you get to really big indexes.



If you go the multiple index route (declaring acts_as_ferret in each of
the models you want to search), you can use the

multi_search(query, additional_models = [], options = {})

method on any of these model classes, giving the list of all other model
classes to search through as the second parameter. the options hash is
the same as for find_by_contents. You have to add the
:store_class_name => true option to your acts_as_ferret calls. That
turns class name storage in the indexes on and let's multi_search know
what class to query for a given hit.



For the single index route, using Rails single table inheritance is the
easiest approach. Just call acts_as_ferret once in your base class, and
use find_by_contents as usual. This is known to work, I use this with
Typo's Content base class.

If this is no option for you, you can configure each model class to use
the same index directory. This approach should work but hasn't got much
(if any) testing so far.
One problem here is that we use the id column as a key in ferret
indexes,
too. So the id has to be unique across the models you want to search.
In addition, you would be on your own for querying the index, I don't
think any of the existing searching methods will work out of the box in
this scenario. The :store_class_name option to acts_as_ferret should
be useful in this contextm, too.

Patches regarding these issues would be very welcome - my
hacking time is quite constrained atm...


After all, I'd either suggest the STI approach or, if that doesn't fit,
the multi-index route.


hope this helps,

Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
525128e48ca2b4c7fb6176ea166fccfd?d=identicon&s=25 Eric G. (gotskill10)
on 2006-10-23 00:41
hey guys, any idea how to use those options with multi_search

I tried it on find_by_contents and it works fine, however, for
multi_search i do:

@results =
User.multi_search(parse(@query),[Book],{:offset=>0,:limit=>5})

or

@results =  User.multi_search(parse(@query),[Book],:offset=>0,:limit=>5)

and neither works, however I get no error either. Whats wrong?
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2006-10-23 11:42
(Received via mailing list)
On Mon, Oct 23, 2006 at 12:41:08AM +0200, Eric Gross wrote:
> @results =  User.multi_search(parse(@query),[Book],:offset=>0,:limit=>5)
>
> and neither works, however I get no error either. Whats wrong?

that's not implemented yet, but there's a patch in trac I plan to
integrate into the next release of aaf.

http://projects.jkraemer.net/acts_as_ferret/ticket/60


Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
8d30b78dcd3ae8ff8d5e6085059060c7?d=identicon&s=25 Martin Bernd Schmeil (thebernd)
on 2006-12-27 18:05
Hi all,

just started to play with (acts_as_)ferret a couple of hours ago, when I
learned that ferret supports fuzzy search.

I could not find an answer to the problem i need to solve yet:

I have a few models with one to many relations to Clients: Addresses,
Contacts, Phone numbers, etc.

i.e. a client may have many addresses and so on.

I need to match a "flat" (each attribute only once) client record
against all the models attributes mentioned above and get a list of
clients with descending probability of being a duplicate.

Is this possible? Which options should I use to save memory and
performance?

Thanks in advance! - Bernd
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2007-01-10 10:02
(Received via mailing list)
On Wed, Dec 27, 2006 at 06:05:46PM +0100, Martin Bernd Schmeil wrote:
> i.e. a client may have many addresses and so on.
>
> I need to match a "flat" (each attribute only once) client record
> against all the models attributes mentioned above and get a list of
> clients with descending probability of being a duplicate.
>
> Is this possible?

As a first try I'd build a single Ferret document for each client,
containing all his contacts, addresses and phone numbers. For better
results you could keep all addresses in one field, phone numbers in
another, and contact names in a third field.

Then take each record you suspect being a duplicate and build a query
from it, using the same way of distributing the data to different
fields.
Running that query against the index should give you a list of possible
duplicate records sorted by relevance.

> Which options should I use to save memory and
> performance?

There seems to be no need to store the field contents themselves in the
index, so this should be turned off with :store => :no when the index is
created. Otherwise I'd first make it work and then look if further
optimization is needed at all - Ferret is *really* fast.

cheers,
Jens


--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
8d30b78dcd3ae8ff8d5e6085059060c7?d=identicon&s=25 Martin Bernd Schmeil (thebernd)
on 2007-01-16 11:49
Hi Jens,

thanks for the answer. (Because of time constraints) I solved the
problem in a different way, i.e. providing each model a client_id method
and then summing up the individual fuzzy search results for each
attribute.

I guess this is neither legant nor performant and I'm not happy with the
resulting scores. But we can live with it for now.

The main issues we have is the well known locking problem and the
scores.

The scores leave us with the problem that - while the order seems to be
correct - we don't know where to cut the line to display results and
what a relevant match is. For a dozen attributes I've seen scores from
0.something to 9.something, with  a result close below 9 not even
looling similar while just above 9 seems to be a "99 percent" match.

If someone would tell me - in case this is possible at all - how to
normalize the scores I'd be very happy.

Another thing which I didn't understand yet is what actually happens if
I do a multi token fuzzy search; currently I'm splitting the string up
in multiple tokens and build one query "attribute:token1~ AND
attribute:token2~ AND ...". Maybe not really what I should do to get
correct scores.

Anyways, thanks for your work and for answering my post.
C9dd93aa135988cabf9183d3210665ca?d=identicon&s=25 Jens Kraemer (Guest)
on 2007-01-16 12:40
(Received via mailing list)
On Tue, Jan 16, 2007 at 11:49:12AM +0100, Martin Bernd Schmeil wrote:
> Hi Jens,
>
[..]
> The main issues we have is the well known locking problem and the
> scores.

Making sure you only have one process writing to the index, i.e. via
an indexer running in backgroundrb, should solve these issues.

> The scores leave us with the problem that - while the order seems to be
> correct - we don't know where to cut the line to display results and
> what a relevant match is. For a dozen attributes I've seen scores from
> 0.something to 9.something, with  a result close below 9 not even
> looling similar while just above 9 seems to be a "99 percent" match.

the calculation of scores is quite complex. To get an idea what happens
in there you can use Ferret's explain method (in
Ferret::Search::Searcher).

> If someone would tell me - in case this is possible at all - how to
> normalize the scores I'd be very happy.

no idea if this is possible - maybe you find some information about
this in the context of Lucene (i.e. in Eric Hatcher's fine Lucene book
or on the lucene mailing list).

> Another thing which I didn't understand yet is what actually happens if
> I do a multi token fuzzy search; currently I'm splitting the string up
> in multiple tokens and build one query "attribute:token1~ AND
> attribute:token2~ AND ...". Maybe not really what I should do to get
> correct scores.

don't know if there is another way to express this with ferret.


cheers,
Jens

--
webit! Gesellschaft für neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer       kraemer@webit.de
Schnorrstraße 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66
8d30b78dcd3ae8ff8d5e6085059060c7?d=identicon&s=25 Martin Bernd Schmeil (thebernd)
on 2007-01-16 12:50
Again thanks for the answers.

I did read the score formula, but my maths knowlege is almost gone now.
I'll look at the docs again if I have more spare time. With the most
relavant word I should be able to scale the scores to percentage.

We have multiple servers running, so we definitly have concurrency
problems. I just didn't play with stuff like backgroundrb yet, so I need
to investigate on how to implemment a single writer solution. But thanks
for the hint.
This topic is locked and can not be replied to.