Re: adding a custom filter to the query

staypufd · July 17, 2006, 11:27am

I for one think this custom filter would be an awesome addition.
Geospatial and local search is a hot area and it would be cool if
ferret facilitated this type of query easily.

Would it be a significant performance hit if ferret has to cycle
through every document for this search? Fine over a couple of hundred,
or thousand? but hundreds of thousands?

Just tossing around the idea but…
This particular search (distance) can be done quite efficiently with
sql. Is it at all feasible that you could ‘outsource’ the query to
sql? Obviously sql could return the id’s simply enough, but i guess
then you’d need to go through each document anyway… To return a
bitset, would the database need to know about the ferret document
order?

Or how about the reverse, use ferret to create a list of ids to pass
into a sql IN query? Afraid I have no idea how efficient that would be
either…

Anyone in here have a best practice?

staypufd · July 17, 2006, 3:14pm

Comments inline…

On 7/17/06, Sam G. [email protected] wrote:

I for one think this custom filter would be an awesome addition.
Geospatial and local search is a hot area and it would be cool if
ferret facilitated this type of query easily.

Agreed. Though it does seem like an abuse of the search engine. The
search engine’s goal is to retrieve as few documents as possible to
satisfy the query, as far as I can tell anyways. David is right, and
performing a calculation on every document makes less and less sense
the more I think about it.

into a sql IN query? Afraid I have no idea how efficient that would be
either…

This is exactly how I’m doing it now, but the problem is that the data
I’m using is so spread out location-wise that sometimes I only get
40-50 good hits for every 1,000 entries returned from ferret. And so I
find myself going back to ferret to retrieve more results a few times
for each query, when I need to return 100 results that are within a
certain distance. This is obviously inefficient. Obviously I could
just pull more results out of ferret in the first place, but most of
the time 1,000 is more than enough to get 100 good results. Obviously
testing will let me find the optimal number to pull from ferret, but I
figured that if I could put the distance calculation into ferret
itself, then I could ask for 100 results, and get 100 results every
time.

Anyone in here have a best practice?

I would like to know if anyone else has tackled this as well, and has
some tips as well.

end

index.search_each(query, :filter_proc => within_radius) {|d, s| ...}
Does this sound like a good idea? If so I could add it to a future
version of Ferret. Please let me know if you can think of a better way
to do this.

This is how I’m doing it now. I guess adding the filter_proc would
clean up my code a bit, and simplify the paging etc. My question would
be how you’d handle the problem that I mentioned earlier, that is how
to determine how many documents to retrieve before the filter_proc is
evaluated in order to eventually return the desired number of
documents. I don’t know enough about the internals of ferret to know
if I’m bringing up a valid point, but I’m guessing that if I only
request the top 5 documents for a query, it doesn’t retrieve every
single document that satisfies the query and then take the top 5 from
that list. Maybe it does though, as I said, I don’t know enough about
the internals of ferret, though I’d like to…

So if the problem that I bring up is legitimate, then the problem
would be in coming up with some sort of heuristic based on how many
documents are expected to satisfy the filter_proc. If only 10% of the
documents satisfy the filter_proc, then to get the top 5 documents
matching a query, we’d want to retrieve the top 50 documents
internally, then pass them through the filter_proc, and hopefully we’d
be left with at least 5 to return. For my specific application, I’m in
a better position to determine this hit percentage, and so I’m in a
better position to do the filtering. I don’t know whether doing this
in ferret would be efficient or even feasible.

Anyways, let me know what your thoughts are on this. The filter_proc
idea is a good one, as long as it can be implemented efficiently.
Otherwise I’ll just keep using my two phase method, retrieve the
documents from ferret, and then do the location filtering in SQL.

–
Cheers,
Jordan F.
[email protected]

staypufd · July 17, 2006, 3:46pm

On 7/17/06, Jordan F. [email protected] wrote:

end
be how you’d handle the problem that I mentioned earlier, that is how
to determine how many documents to retrieve before the filter_proc is
evaluated in order to eventually return the desired number of
documents. I don’t know enough about the internals of ferret to know
if I’m bringing up a valid point, but I’m guessing that if I only
request the top 5 documents for a query, it doesn’t retrieve every
single document that satisfies the query and then take the top 5 from
that list. Maybe it does though, as I said, I don’t know enough about
the internals of ferret, though I’d like to…

Ferret actually has to check the score of every singly document in the
index that matches the query. It keeps a priority queue of as many
documents as it needs to return the result set. So if :num_docs is 50,
and :first_doc is 200 Ferret will need to keep a priority queue of 250
documents.

So if the problem that I bring up is legitimate, then the problem
would be in coming up with some sort of heuristic based on how many
documents are expected to satisfy the filter_proc. If only 10% of the
documents satisfy the filter_proc, then to get the top 5 documents
matching a query, we’d want to retrieve the top 50 documents
internally, then pass them through the filter_proc, and hopefully we’d
be left with at least 5 to return. For my specific application, I’m in
a better position to determine this hit percentage, and so I’m in a
better position to do the filtering. I don’t know whether doing this
in ferret would be efficient or even feasible.

You wouldn’t need to request more documents than you need using the
:filter_proc idea. You’d just specify :num_docs as usual and you’d get
:num_docs back. So if you want 50 documents you’d get 50 documents (or
less if fewer documents matched the query and distance constraint).

Anyways, let me know what your thoughts are on this. The filter_proc
idea is a good one, as long as it can be implemented efficiently.
Otherwise I’ll just keep using my two phase method, retrieve the
documents from ferret, and then do the location filtering in SQL.

The proc would just be called once for every matching document in the
result set, not every document. It shouldn’t be too expensive at all
and probably a lot more efficient than filtering using the SQL method.

Cheers,
Dave

staypufd · July 17, 2006, 8:08pm

This is a “me too” post. I would love to replace the query filter we
use on tourb.us with this.

gary

staypufd · July 17, 2006, 5:29pm

On 7/17/06, David B. [email protected] wrote:

Ferret actually has to check the score of every singly document in the
index that matches the query. It keeps a priority queue of as many
documents as it needs to return the result set. So if :num_docs is 50,
and :first_doc is 200 Ferret will need to keep a priority queue of 250
documents.

The proc would just be called once for every matching document in the
result set, not every document. It shouldn’t be too expensive at all
and probably a lot more efficient than filtering using the SQL method.

If that’s the case, then I think the filter_proc idea would be
fantastic, and I’d love to see it make it’s way into a future version.

–
Cheers,
Jordan F.
[email protected]

staypufd · July 18, 2006, 8:52pm

On 7/17/06, Gary E. [email protected] wrote:

This is a “me too” post. I would love to replace the query filter we
use on tourb.us with this.

gary

Maybe Gary, or someone else can help me, but I’ve put the query filter
problem aside, and I’m trying to do this by finding locations within a
bounding box using Range queries on the longitude and latitude.
Unfortunately I’m running into some problems, since I’m comparing
numeric values that can be positive or negative, and as far as I can
tell, Ferret (actually I’ve only been able to find information about
Lucene, but I’m assuming it’s the same) does the comparisons
lexicographically, and not numerically.

So I’ve tried to replicate the encoding as they do in
SearchNumericalFields - Apache Lucene (Java) - Apache Software Foundation, but I’m
encountering some strange behaviour that is throwing me off.

So I index a bunch of documents, and see the following line of output:
Adding field latitude_string with value ‘004915010’ to index
So that is the encoded version of 49.1501.
Now if I do the following query, I should get this record back:

Person.ferret_index.search_each(“latitude_string:[‘000000000’ ‘099999999’]”)
=> 0

But I don’t, and I can verify that lexicographically, ruby sees
‘004915010’ as lying between ‘000000000’ and ‘099999999’:

‘000000000’ <= ‘004915010’ and ‘004915010’ <= ‘099999999’
=> true

But the query returns no results. I’ve tried a few more, as follows:

Person.ferret_index.search_each(“latitude_string:(> ‘000000000’)”) do end
=> 7
Person.ferret_index.search_each(“latitude_string:(< ‘099900000’)”) do end
=> 0

And so clearly it is not seeing that ‘004915010’ < ‘099999999’. If I
remove the quotes, it works properly, but the problem is then with the
negative values.

Person.ferret_index.search_each(“latitude_string:(> -00000000)”) do end
=> 0
Person.ferret_index.search_each(“latitude_string:(> ‘-00000000’)”) do end
=> 7

So the quotes affect things, but then what if I need to search between
a negative value and a positive value.

Person.ferret_index.search_each(“latitude_string:(< 099999999)”) do end
=> 7
Person.ferret_index.search_each(“longitude_string:(> ‘-00000000’)”) do end
=> 7
Person.ferret_index.search_each(“latitude_string:[‘-00000000’
099999999]”) do end
=> 0

For now should I just not be using range queries at all, and just
quote negative values? I’d have to do more testing to see if it’s
accurate, but it seems to be the only way that works…maybe I could
make all values positive by adding a constant to them all?

Any ideas why this is occuring? Am I doing this completely backwards,
is there an easier way to do the numeric comparisons? I’m very sorry
if this is an issue that has been discussed before, but I did look
through the archives and didn’t find anything…

–
Cheers,
Jordan F.
[email protected]

staypufd · July 18, 2006, 9:59pm

On 7/18/06, Jean-Etienne D. [email protected] wrote:

Jordan,

Why not using NumberTools::long_to_s to convert your numeric values
(indexing & search) ?

Jean-Etienne

Well, because I am a fool, and did not notice this class that seems to
be exactly what I need.

–
Cheers,
Jordan F.
[email protected]

staypufd · July 18, 2006, 10:28pm

On 7/18/06, Jordan F. [email protected] wrote:

Well, because I am a fool, and did not notice this class that seems to
be exactly what I need.

–
Cheers,
Jordan F.
[email protected]

Actually, I spoke too soon. It appears that this class has the same
problem with negative numbers. For example:

Person.search(“latitude_string:[00000000000000 0000000000nesr]”).length
=> 7
Person.search(“latitude_string:[-1y2p0ij321x6p 0000000000nesr]”).length
=> 0

I’ve expanded my range, so shouldn’t the number of results be at least
what it was with all 0’s? I’ve tried with quotes too, and it doesn’t
help. Again though, if I do the following (note the quotes):

Person.search(“latitude_string:(> ‘-1y2p0ij321x6p’ AND <
0000000000nesr”).length
=> 7

It works…

So what i’ve done, is because I’m only working with longitudes and
latitudes, which are guaranteed to lie between -500 and 500, I’m just
adding 500 to them, to make them all positive, and then I can use the
range queries…and I wrote my own little number to string thing,
since I’m working with small values. But nevertheless I thank you for
your help.

–
Cheers,
Jordan F.
[email protected]

staypufd · July 18, 2006, 9:37pm

Jordan,

Why not using NumberTools::long_to_s to convert your numeric values
(indexing & search) ?

Jean-Etienne

staypufd · July 19, 2006, 3:41am

On 7/19/06, Jordan F. [email protected] wrote:

On 7/17/06, Gary E. [email protected] wrote:

So I index a bunch of documents, and see the following line of output: Adding field latitude_string with value '004915010' to index So that is the encoded version of 49.1501. Now if I do the following query, I should get this record back: >> Person.ferret_index.search_each("latitude_string:['000000000' '099999999']") => 0

irb(main):008:0> index.search(“latitude:[000000000 099999999]”).size
=> 1
irb(main):009:0> index.search(“latitude:[‘000000000’ ‘099999999’]”).size
=> 0

The quotes are getting tokenized with the terms so the problem is that
“‘0099999999’” <= ‘004915010’

Perhaps you already worked that out.

Dave

staypufd · July 19, 2006, 3:48am

On 7/19/06, Jordan F. [email protected] wrote:

help. Again though, if I do the following (note the quotes):
range queries…and I wrote my own little number to string thing,
since I’m working with small values. But nevertheless I thank you for
your help.

This seems like the best solution at the moment. I’d forgotten about
NumTools. It’s probably one of the first modules I ever wrote in Ruby.
Anyway, it looks like it might need an upgrade. I’ll try and fix it so
that it can handle negative numbers. In C this would be a no-brainer
but Ruby’s BigNums make it a little difficult. I might put the
challenge to the Ruby mailing list.

Cheers,
Dave