Grouping results

I have a general question about using a Ferret/Lucene index for
grouping results. I am not sure how much of the heavy lifting the
index can do for me, so I would appreciate any input. I am using
ferret to index some objects that have the following properties:

url, image_url, price, tags (space separated tags), created_at

I would like search the index for any documents that match a specific
tag. The way these results will be processed is as follows:

Each URL must be unique in the results. If there are duplicates, I
would like to merge the results using some fuzzy merge criteria.
Ideally, this merge would take the most common occurrence of each of
the properties and apply them to the final single result.

My current thoughts on how to implement this is to search the index
using a standard search and sorting by the URL. Then I will just
manually apply the merge logic to each set of URLs.

Does this sound reasonable?

Thanks,
Tom

On 1/27/06, Tom D. [email protected] wrote:

Each URL must be unique in the results. If there are duplicates, I
would like to merge the results using some fuzzy merge criteria.
Ideally, this merge would take the most common occurrence of each of
the properties and apply them to the final single result.

My current thoughts on how to implement this is to search the index
using a standard search and sorting by the URL. Then I will just
manually apply the merge logic to each set of URLs.

Does this sound reasonable?

Hi Tom,

That sounds like the way I’d probably do it. I don’t know if this will
help but did you know that documents can contain multiple fields with
the same name? So effectively you could store a unique document for
each URL and store an array of image_urls, prices and tags in that
document.

Hope that helps,
Dave

Thanks Dave. Actually I did not know that. That may be a useful
feature. The only problem I forsee is how to remove a reference to
each of those properties from the array when a document is deleted. I
will give it some more thought, but it is nice to have options.

Thanks again,
Tom