Strange wildcard problem

Hi,
Apologies for reposting this for those who read this via ruby-forum,
but it didn’t make it to the list before, and the list seems more
active…
I’m using ferret (via acts_as_ferret) in a somewhat unorthodox
manner and am having a strange wildcard problem. Before anyone wonders
why we’re doing things this way, the answer is basically that it lets
us precompute what would be expensive database queries and store the
results in a simple way (ferret index) prior to pushing the static
data to our production server.
Basically, I’ve got two (for the sake of simplicity) models, both of
which are indexed on a similar (but separate) non-model field.
However, one of those two models does not seem to get the proper
number of results for a wildcard search:
First of all, there’s a non-indexed model called ProductTuple that’s
got a supplier_id as well as a product_category_id and
product_material_id as well as some other id fields that aren’t really
important here. Thus, a ProductTuple has foreign key relationships to
Suppliers and ProductCategories and ProductMaterials, but for ferret
purposes just think of those foreign keys as what they are - ids (e.g.
integers).
The first model, Supplier, is ferret-indexed on several fields, such
as the supplier name and supplier country, as well as the
‘ferret_product_tuples’ non-model field. ferret_product_tuples simply
takes all the product tuples for a supplier and concatenates their
product_category_id, product_material_id, etc. with delimiters.
So, for a product tuple with product_category_id 82,
product_material_id 88, and undefined product_technique_id, the
resulting part of the ferret_product_tuple string would look like
x00082_00088_00000x (where we use 00000 to indicate null). the xs are
used as anchors, essentially, as a given supplier’s
ferret_product_tuple string might look like ‘x00082_00088_00000x
x00000_00081_00013x’.
Now, the ferret query that gets constructed when we do the relevant
queries simply looks like:
‘ferret_product_tuple:x00082_??????x’
and this would, in the above instance, match that supplier.
Everything I’ve described works perfectly, EXCEPT…
we also index product_categories on this same string. So product
category #82 would have a bunch of ferret_product_tuple strings that
start out x00082 and have various things in the other positions.
Here’s what’s strange… a product_category query for
'ferret_product_tuple:x???
??????x’ should return ALL product
categories, right? Yet it only returns six. A product category query
for ‘ferret_product_tuple:x???00081???x’ should return all the
product categories that share product_tuples with product_material
#81, but in fact returns only a small number of categories. Yet making
the wildcard match MORE restrictive by substituting
'ferret_product_tuple:x00082_00081
???x’ into that query yields
product_category #82, which is erroneously not included in the 6
results for ‘ferret_product_tuple:x???00081???x’.
So, have I stumbled upon a bug in the wildcard handling? My initial
thought was that the different analyzer I was using for the
product_category index was the culprit, but I changed that analyzer
out to no effect, so I’ve ruled that out.
Any ideas? Thanks!

Hi!

wildcard queries have a built in upper limit of terms they search for,
which by default is set to 512 (according to
http://ferret.davebalmain.com/api/classes/Ferret/Search/WildcardQuery.html).

So when you query for asdf*, Ferret expands this to all terms in your
index starting with asdf, but will stop after collecting 512 terms, then
go and retrieve all documents containing these 512 terms, obviously
missing those that would in theory match your query, but do this by
containing a matching term that wasn’t retrieved in the first step.

Of course you can set the max_term count to a higher value, but in the
long run this isn’t really a solution. If I understand you correctly,
your tuple field right now has a single term for each document, and that
term is different for each document. Splitting up your tuple values into
several different terms could help to reduce the number of terms needed
to fetch for a wild card query.

Cheers,
Jens

On Mon, Nov 05, 2007 at 04:11:53PM -0500, Noah M. Daniels wrote:

Basically, I’ve got two (for the sake of simplicity) models, both of
The first model, Supplier, is ferret-indexed on several fields, such
x00000_00081_00013x’.
categories, right? Yet it only returns six. A product category query
out to no effect, so I’ve ruled that out.
Any ideas? Thanks!


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Jens K. wrote:

Hi!

wildcard queries have a built in upper limit of terms they search for,
which by default is set to 512 (according to
http://ferret.davebalmain.com/api/classes/Ferret/Search/WildcardQuery.html).

So when you query for asdf*, Ferret expands this to all terms in your
index starting with asdf, but will stop after collecting 512 terms, then
go and retrieve all documents containing these 512 terms, obviously
missing those that would in theory match your query, but do this by
containing a matching term that wasn’t retrieved in the first step.

Of course you can set the max_term count to a higher value, but in the
long run this isn’t really a solution. If I understand you correctly,
your tuple field right now has a single term for each document, and that
term is different for each document. Splitting up your tuple values into
several different terms could help to reduce the number of terms needed
to fetch for a wild card query.

Interesting, thanks. Actually I can’t split the tuple values up – the
requirement is to see those terms occur together in the same tuple, not
just for the same document (there is a difference in this case). So,
I’ll try expanding the max_term count to see if that helps; otherwise
I’ll have to rethink the solution.

On Tue, Nov 06, 2007 at 11:25:56AM -0500, Noah M. Daniels wrote:

just for the same document (there is a difference in this case). So,
option so it’ll work for a remote server with AAF?
Placing it at the end of acts_as_ferret’s init.rb should work.

Cheers,
Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

On Nov 6, 2007, at 9:00 AM, Noah Daniels wrote:

Jens K. wrote:

Interesting, thanks. Actually I can’t split the tuple values up – the
requirement is to see those terms occur together in the same tuple,
not
just for the same document (there is a difference in this case). So,
I’ll try expanding the max_term count to see if that helps; otherwise
I’ll have to rethink the solution.

Jens, many thanks; upping the max_terms (max_clauses seems to be the
same thing) solved the problem beautifully. However, now I’m trying to
get this working with a remote ferret server (using acts_as_ferret)
and not having any luck. Particularly, I can’t figure out where to set
max_terms (or Ferret::Search::MultiTermQuery.default_max_terms= ) such
that the remote ferret server will pick it up – including in the
start script for the remote ferret server. Where can I change this
option so it’ll work for a remote server with AAF?

thanks!

On Nov 6, 2007, at 11:35 AM, Jens K. wrote:

On Tue, Nov 06, 2007 at 11:25:56AM -0500, Noah M. Daniels wrote:

Placing it at the end of acts_as_ferret’s init.rb should work.

Unfortunately, it doesn’t seem to. For a local index, I can just put
this anywhere in code (even in a controller, or in the console) and I
start getting correct results from my query:

Ferret::Search::MultiTermQuery.default_max_terms = 5000

but on my staging server, where a drb ferret server is used, putting
that line in the init.rb doesn’t seem to do anything – in fact, even
putting it into the initialize method of the LocalIndex class doesn’t
help! Any ideas?

thanks!

Just a ‘ping’ since I still haven’t been able to solve this without
doing what I don’t want to do (putting this into my local copy of
ferret itself)

Setting this in init.rb in the acts_as_ferret plugin does nothing.
Does anyone have a suggestion for where it would work?

thanks!