GeoQuery with acts_as_ferret involved

So, I’m working on a search engine of sorts that restricts results to
your local area. I can successfully return all entries within 15 miles
of a particular point, and I can successfully return all entries that
match a search query, but I’m having trouble combining the two together
and doing pagination on them.

Basically, for the range query, you do a SQL query that returns all
results within +/- 1 latitude/longitude from the point in question, and
then you do some spherical trig on each of those results to get only the
entries within X number of miles of the point.

And for the search query, so far, I’ve been using acts_as_ferret’s
find_by_contents method. But now I need to figure out how to take an
array of results from the range query, and only do the find_by_contents
magic on just the entries in that Array. So far, everything method I’ve
thought of looks like it’s going to have performance problems.

Any suggestions?

On Wed, Jan 31, 2007 at 03:21:31PM +0100, Bob A. wrote:

So, I’m working on a search engine of sorts that restricts results to
your local area. I can successfully return all entries within 15 miles
of a particular point, and I can successfully return all entries that
match a search query, but I’m having trouble combining the two together
and doing pagination on them.

Basically, for the range query, you do a SQL query that returns all
results within +/- 1 latitude/longitude from the point in question, and
then you do some spherical trig on each of those results to get only the
entries within X number of miles of the point.

wouldn’t it be possible to query the ferret index for all points withing
the
+/- 1 long/lat range? then you could combine the user’s search terms
with that
query, and afterwards do the calculations to filter out those points
outside
the x miles radius.

And for the search query, so far, I’ve been using acts_as_ferret’s
find_by_contents method. But now I need to figure out how to take an
array of results from the range query, and only do the find_by_contents
magic on just the entries in that Array. So far, everything method I’ve
thought of looks like it’s going to have performance problems.

In any case I’d first try what works, and then look at the performance
of different approaches with a realistical amount of data.

That aside, my first guess is that narrowing down the result set with
Ferret
before doing any spatial calculations would be a good idea.

Jens


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

On Jan 31, 2007, at 6:21 AM, Bob A. wrote:

And for the search query, so far, I’ve been using acts_as_ferret’s
find_by_contents method. But now I need to figure out how to take an
array of results from the range query, and only do the
find_by_contents
magic on just the entries in that Array. So far, everything method
I’ve
thought of looks like it’s going to have performance problems.

Any suggestions?

Bob- I don’t know what would be involved in using this method with
aaf, but I am using this method for a bounding box geo query. If you
need a more strict radial search you can use a custom filter with
0.10.x.

During index population I’m doing:

doc << Field.new(“latitude”, latitude.to_f + 1000,
Field::Store::NO, Field::Index::UNTOKENIZED)
doc << Field.new(“longitude”, longitude.to_f + 1000,
Field::Store::NO, Field::Index::UNTOKENIZED)

and during the query I’m doing:

query << "latitude:[#{box[:lat_min] + 1000} #{box[:lat_max] + 1000}]
AND "
query << "longitude:[#{box[:lon_min] + 1000} #{box[:lon_max] + 1000}]
AND "

I have helper methods outside of this scope to handle the min/max
that I’m searching in. I also have a complete (yet untested)
GeoFilter, but we aren’t using the .10.x Ferret yet so I have no idea
if it actually works.

Michael-

I could go on for hours about this, but lemme see if I can summarize
what I
know.

Search engines aren’t really optimized to geographic queries (unless
they
have a built-in geographic index feature). You can make them work, but
it’s
gonna be a hack.

Several things to try:

  • What you described already, the lat / lon box idea. The issue with
    that
    is that it selects the lat first, the lon second. By doing so, it’s
    like
    taking a vertical or horizontal stripe of the country and sticking it in
    a
    temporary result set. That temporary result set can be HUGE, especially
    on
    the US East Coast. Which is why this is slow.
  • You could, instead, select all the zip codes or postal codes in the
    area,
    first, then do your precise calculations . That will be fast if and
    only if
    Ferret can handle that many query terms at once. Most search engines
    can’t,
    really, but still, this is usually a bit faster than the first solution.
    The hard part here is computing the set of zip codes you want in the
    first
    place.
  • You could limit individual queries to greater metro areas, first, then
    do
    your precise calculations. Two issues here: one, getting that data;
    two,
    those areas have borders, so getting coverage is difficult at best.

Faster, still, is doing the zip code coverage area solution in a
database.
The reason this’d be faster is that you can take your set of zip codes
covering your search area, and join them to the table in question.
Joins
are much faster than a list of search query terms.

Like I said, the true solution is geographic indexes. Unfortunately,
Ferret
doesn’t have them. Maybe Lucene does? It’s possible to fake geographic
indexes in a non-geographic engine, but it’s really nasty math; I’d only
recommend that if you need to bleed every last ounce of speed out of it.

Schnitz

Oh, and one bit of advice on lat/lon boxes -

The order the engine performs the lat & lon query - either lat then lon,
or
lon then lat - matters. Try swapping the terms in the queries. You
want
the engine to select the vertical stripe (lon between x and y) first in
most
of the US, because that has the least chance of picking up other major
metro
areas. If you’re really slick, though, you’ll switch it up depending on
the
area of the country, because that vertical stripe is murder on certain
parts
of the East Coast.

(This all assumes you’re in the US, but the principle is the same
wherever
you are - check your map!)

Schnitz

That’s a good plan. Beware the number of search terms, though.
Typically
search engines slow down pretty quick the more you add search terms onto
the
query. It creates a lot of merging work on the back-end of the query
execution.

Schnitz

wouldn’t it be possible to query the ferret index for all points withing
the
+/- 1 long/lat range? then you could combine the user’s search terms
with that
query, and afterwards do the calculations to filter out those points
outside
the x miles radius.

Hadn’t actually thought of having ferret index the lat/long points.
However, I’m pretty sure I don’t want to go down that route.

However, while I was waiting for an answer, I think I figured out how to
do this query in the fastest way possible. In this case, it would be
optimized specifically for my app, and for this one query, but that’s
fine by me because the app only has one thing you ever search against.

The way things are set up, you have a single location in the DB for each
subscriber. Each location can have multiple entries. The geoquery is
actually done against the locations, not the entries. So there’s a LOT
fewer of them. You might have 10-15 locations in a city, max, but each
location could have hundreds of entries. So you do the geoquery only
against the locations, and you get an array of ids for locations within
range of the point back. Now you just change the indexed entries to
also include the ids of the locations they’re associated with. When you
do the search, you include the list of ids that it can match against as
part of the query.

That should be possible, correct?

I haven’t sat down and worked out exactly what the ferret query syntax
would look like for that though.