Comments inline…
On 7/17/06, Sam G. [email protected] wrote:
I for one think this custom filter would be an awesome addition.
Geospatial and local search is a hot area and it would be cool if
ferret facilitated this type of query easily.
Agreed. Though it does seem like an abuse of the search engine. The
search engine’s goal is to retrieve as few documents as possible to
satisfy the query, as far as I can tell anyways. David is right, and
performing a calculation on every document makes less and less sense
the more I think about it.
into a sql IN query? Afraid I have no idea how efficient that would be
either…
This is exactly how I’m doing it now, but the problem is that the data
I’m using is so spread out location-wise that sometimes I only get
40-50 good hits for every 1,000 entries returned from ferret. And so I
find myself going back to ferret to retrieve more results a few times
for each query, when I need to return 100 results that are within a
certain distance. This is obviously inefficient. Obviously I could
just pull more results out of ferret in the first place, but most of
the time 1,000 is more than enough to get 100 good results. Obviously
testing will let me find the optimal number to pull from ferret, but I
figured that if I could put the distance calculation into ferret
itself, then I could ask for 100 results, and get 100 results every
time.
Anyone in here have a best practice?
I would like to know if anyone else has tackled this as well, and has
some tips as well.
end
index.search_each(query, :filter_proc => within_radius) {|d, s| ...}
Does this sound like a good idea? If so I could add it to a future
version of Ferret. Please let me know if you can think of a better way
to do this.
This is how I’m doing it now. I guess adding the filter_proc would
clean up my code a bit, and simplify the paging etc. My question would
be how you’d handle the problem that I mentioned earlier, that is how
to determine how many documents to retrieve before the filter_proc is
evaluated in order to eventually return the desired number of
documents. I don’t know enough about the internals of ferret to know
if I’m bringing up a valid point, but I’m guessing that if I only
request the top 5 documents for a query, it doesn’t retrieve every
single document that satisfies the query and then take the top 5 from
that list. Maybe it does though, as I said, I don’t know enough about
the internals of ferret, though I’d like to…
So if the problem that I bring up is legitimate, then the problem
would be in coming up with some sort of heuristic based on how many
documents are expected to satisfy the filter_proc. If only 10% of the
documents satisfy the filter_proc, then to get the top 5 documents
matching a query, we’d want to retrieve the top 50 documents
internally, then pass them through the filter_proc, and hopefully we’d
be left with at least 5 to return. For my specific application, I’m in
a better position to determine this hit percentage, and so I’m in a
better position to do the filtering. I don’t know whether doing this
in ferret would be efficient or even feasible.
Anyways, let me know what your thoughts are on this. The filter_proc
idea is a good one, as long as it can be implemented efficiently.
Otherwise I’ll just keep using my two phase method, retrieve the
documents from ferret, and then do the location filtering in SQL.
–
Cheers,
Jordan F.
[email protected]