Comments inline…

On 7/17/06, Sam G. [email protected] wrote:

I for one think this custom filter would be an awesome addition.

Geospatial and local search is a hot area and it would be cool if

ferret facilitated this type of query easily.

Agreed. Though it does seem like an abuse of the search engine. The

search engine’s goal is to retrieve as few documents as possible to

satisfy the query, as far as I can tell anyways. David is right, and

performing a calculation on every document makes less and less sense

the more I think about it.

into a sql IN query? Afraid I have no idea how efficient that would be

either…

This is exactly how I’m doing it now, but the problem is that the data

I’m using is so spread out location-wise that sometimes I only get

40-50 good hits for every 1,000 entries returned from ferret. And so I

find myself going back to ferret to retrieve more results a few times

for each query, when I need to return 100 results that are within a

certain distance. This is obviously inefficient. Obviously I could

just pull more results out of ferret in the first place, but most of

the time 1,000 is more than enough to get 100 good results. Obviously

testing will let me find the optimal number to pull from ferret, but I

figured that if I could put the distance calculation into ferret

itself, then I could ask for 100 results, and get 100 results every

time.

Anyone in here have a best practice?

I would like to know if anyone else has tackled this as well, and has

some tips as well.

```
end
index.search_each(query, :filter_proc => within_radius) {|d, s| ...}
```

Does this sound like a good idea? If so I could add it to a future

version of Ferret. Please let me know if you can think of a better way

to do this.

This is how I’m doing it now. I guess adding the filter_proc would

clean up my code a bit, and simplify the paging etc. My question would

be how you’d handle the problem that I mentioned earlier, that is how

to determine how many documents to retrieve before the filter_proc is

evaluated in order to eventually return the desired number of

documents. I don’t know enough about the internals of ferret to know

if I’m bringing up a valid point, but I’m guessing that if I only

request the top 5 documents for a query, it doesn’t retrieve every

single document that satisfies the query and then take the top 5 from

that list. Maybe it does though, as I said, I don’t know enough about

the internals of ferret, though I’d like to…

So if the problem that I bring up is legitimate, then the problem

would be in coming up with some sort of heuristic based on how many

documents are expected to satisfy the filter_proc. If only 10% of the

documents satisfy the filter_proc, then to get the top 5 documents

matching a query, we’d want to retrieve the top 50 documents

internally, then pass them through the filter_proc, and hopefully we’d

be left with at least 5 to return. For my specific application, I’m in

a better position to determine this hit percentage, and so I’m in a

better position to do the filtering. I don’t know whether doing this

in ferret would be efficient or even feasible.

Anyways, let me know what your thoughts are on this. The filter_proc

idea is a good one, as long as it can be implemented efficiently.

Otherwise I’ll just keep using my two phase method, retrieve the

documents from ferret, and then do the location filtering in SQL.

–

Cheers,

Jordan F.

[email protected]