Partition results based on field

Hello all

I’m using Ferret for a site wide search where I have several kinds of
(similar) objects in a central index (using a “type” field containing
the class name). This works great, and I can search all objects with one
query.

What I’d like to do now is to limit the results so that there will be a
maximum of 10 (or 5 or whatever) results for each type… I can’t figure
out how to do this, so I thought maybe someone brighter than me has done
this before or knows how to do it? :slight_smile:

Trent S.

On 6/23/06, Trent S. [email protected] wrote:

this before or knows how to do it? :slight_smile:

Trent S.

Hi Trent,

The way to do this is to search for more than you need and then
actually go through each search result and count the types in a hash,
only adding a doc if it’s type count is under the threshold. If you
failed to retrieve enough results then search again and repeat until
you get the required number of results. For those of you who know the
Lucene API, this is where a Hits class comes in handy. It’ll be coming
in a future version. For now I’ll show you the easiest wat by doing a
search and setting :num_docs to max_doc, thereby getting all search
results in one go;

def get_results(search_str, max_type = 5, num_required = 10)
    type_counter = Hash.new(0)
    results = []
    index.search_each(search_str, :num_docs => index.size) do

|doc_id, score|
doc = index[doc_id]
if type_counter[doc[:type]] < max_type
results << doc
type_counter[doc[:type]] += 1
end
break if results.size >= num_required
end
return results
end

Hope that helps,
Dave

David B. wrote:

Hi Trent,

The way to do this is to search for more than you need and then
actually go through each search result and count the types in a hash,
only adding a doc if it’s type count is under the threshold. If you
failed to retrieve enough results then search again and repeat until
you get the required number of results. For those of you who know the
Lucene API, this is where a Hits class comes in handy. It’ll be coming
in a future version. For now I’ll show you the easiest wat by doing a
search and setting :num_docs to max_doc, thereby getting all search
results in one go;

def get_results(search_str, max_type = 5, num_required = 10)
    type_counter = Hash.new(0)
    results = []
    index.search_each(search_str, :num_docs => index.size) do

|doc_id, score|
doc = index[doc_id]
if type_counter[doc[:type]] < max_type
results << doc
type_counter[doc[:type]] += 1
end
break if results.size >= num_required
end
return results
end

Hope that helps,
Dave

Hi,

I suspected I’d have to do something like this. Thanks for putting me on
the right path. Are there any concerns about scalability/speed when the
index grows larger regarding searching the whole index like this?

T

On 6/27/06, Trent S. [email protected] wrote:

search and setting :num_docs to max_doc, thereby getting all search
type_counter[doc[:type]] += 1

I suspected I’d have to do something like this. Thanks for putting me on
the right path. Are there any concerns about scalability/speed when the
index grows larger regarding searching the whole index like this?

As long as you’re using the C backed version of Ferret, the index
would have to grow very large before speed becomes a concern in this
case. Note that Ferret actually has to go through every single search
result anyway to check its score, no matter what you have num_docs set
to. The only thing that you are using more of with a high value of
num_docs is memory (approximately 12-bytes per hit).

Cheers,
Dave