Hi,
I’m looking for some validation for some work I’ve done for a client,
and
I’m open to criticism (“mock me” ? ;^), relevant awareness of similar
projects, and alternatives.
When I looked around in about September 2007 for a good scalable search
solution for Ruby on Rails, I found the choices lacking. Firstly, none
of
the solutions seemed to have an option for keeping the reverse indices
in-memory across any number of machines I might like to store them.
Secondly, many of the solutions seemed too general purpose and heavy
weight
for my client’s needs (which are basically to search for items from the
db,
based on tags). But without addressing the first concern, I felt that
anything I implemented would not scale to the customer’s needs and
aspirations, and that for such an investment, virtually unlimited scale
would be mandatory.
Therefore I looked at memcached - well-proven on many large-scale sites
for
caching, but to my knowledge not used in search. Note that memcached
uses
an approach wherein the clients all calculate a server based on a given
key,
such that no central (scale-limiting) controller is required. Having
chosen
memcached, I next attempted to use various memcached connectors into
RoR. I
found them at the time (Oct 2007 or so) to be slow and buggy; it didn’t
take
more than a couple of times of totally corrupting the entire cache to
avert
my attention from a Ruby approach to using memcached. Meanwhile, I knew
from prior experience that the python client for memcached was both fast
and
reliable. The python memcached client was routinely 3x faster for the
tests
I ran. Python also seems to be quite fast at set operations.
Getting to the punchline, I used python and memcached, wrapped in
twisted,
to provide a ReSTful web service api, which is called from RoR to get
ALL of
the information needed to render search results. The API has been
extended
to allow the Ruby code to “fire and forget” new indexing info onto a
deque
(fifo queue), which is processed by a loosely-coupled daemon - overhead
to
Ruby is about 20ms.
Prior to this approach, the client was using MyISAM full text search.
Search results were 10s for smaller search terms (5000 uses), and 20+s
for
larger search terms (100k+ uses).
With the web service, the search results are routinely returned in 1-2
seconds, and the web service itself returns results to RoR within
100-200ms. Indexing is a challenge - the rank score needs to be updated
upon each viewing, but I’ve now gotten that to be almost real-time (5
minutes max). Plus I can re-index the entire database of 1M+ items in
about
8 hours. The index is backed up nightly in case of a memcached server
failure (we’re using 3). In addition to search, the search web service
is
used for relatedness and for something like bookmarks.
So, is there anything out there that can touch these results and provide
for
virtually unlimited scale (no central controller)?
Thanks in advance,
Marc
PS: Because of leaks in rmagick and its inferior performance compared
to
the Python Image Library, I’m also considering a similar approach for
generating many different sizes of fairly large (10MB) images. A
similar
fire and forget web service approach could be used to minimize the
impact on
the RoR side. Early tests show a 10x speed improvement (even without
the
fire and forget). Any thoughts there?