Multithreading / multiprocessing woes

scottd72 · November 16, 2007, 11:57am

I’ve been running some multithreaded tests on Ferret. Using a single
Ferret::Index::Index inside a DRb server, it definitely behaves for me
as if all readers are locked out of the index when writing is going on
in that index, not just optimization – at least when segment merging
happens, which is when the writes take the longest and you can
therefore least afford to lock out all reads. This is very easy to
notice when you add, say, your 100,000th document to the index, and
that one write takes over 5 seconds to complete because it triggers a
bunch of incremental segment-merging, and all queries to the index
stall in the meantime. Or when you add your millionth document, which
can stall all reads for over a minute.

When I try to use an IndexReader in a separate process, things are
even worse. The IndexReader doesn’t see any updates to the index
since it was created. Not too surprising, but if I try creating a new
IndexReader for every query, and have the Index in the other writing
process turn on auto_flush, then the reading process crashes after a
few (generally fewer than 100) queries, in one of at least two
different ways selected apparently at random:

Failure Mode #1:

script/ferret_speedtest2_reader:30:in `initialize’: IO Error occured
at <except.c>:93 in xraise (IOError)
Error occured in index.c:901 - sis_find_segments_file
Error reading the segment infos. Store listing was

from script/ferret_speedtest2_reader:30:in new' from script/ferret_speedtest2_reader:30:inrun_test_query’

[Yes, there really are two blank lines after “Store listing was”.]

Failure Mode #2:
script/ferret_speedtest2_reader:30:in `initialize’: IO Error occured
at <except.c>:93 in xraise (IOError)
Error occured in fs_store.c:127 - fs_each
doing ‘each’ in
/Users/scott/dev/ruby/timetracker/tmp/ferret_speedtest_index:

from script/ferret_speedtest2_reader:30:in new' from script/ferret_speedtest2_reader:30:inrun_test_query’

Meanwhile, if I try eliminating this second failure mode by explicitly
calling close on the IndexReader
before I throw it away, the close immediately crashes with:

script/ferret_speedtest2_reader:45: [BUG] Bus Error
ruby 1.8.6 (2007-03-13) [i686-darwin8.8.5]

Abort trap

Given the combination of problems above, I’m at a loss to understand
how to use Ferret on a live website that requires reasonably fast
turnaround between a user submitting data and the user being able to
search over that data, unless either (1) the site only gets a few
thousand new index entries per day and the site can be taken down for
a few minutes daily to optimize the index, or (2) it’s OK for the
entire site to periodically stall on all queries for seconds or even
minutes whenever segment-merging happens to kick in.

Do all Ferret users just suck it up and live with one of these
limitations, or am I missing something and/or just getting “lucky”
with the errors above?

For reference, the system being used here is a Mac running Leopard,
although I doubt that matters…

scottd72 · November 16, 2007, 1:12pm

Scott,

Do all Ferret users just suck it up and live with one of these
limitations, or am I missing something and/or just getting “lucky”
with the errors above?

This limitations you’re talking about are known and will be fixed
in the near future… the trick is, to have one read-only and one
write-only index… This is currently being worked on. If you need
a fix right now, you need to do it yourself but you can take a look
on omdb’s code and how it’s done there:

http://bugs.omdb.org/browser/branches/2007.1/lib/omdb/ferret/lib/util.rb
(see the switch code)

If you don’t need a fix right now, i’m sure AAF will come up with
a solution for that in the near future (aka probably not this year).

on a side note… for the to many open files error, see:

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html
(use_compound_file, you may have set this to false) or simply increase
the number of open files. On omdb we’re running with 32k

[email protected] ~ $ ulimit -n
32768

Cheers
Ben

scottd72 · November 16, 2007, 9:35pm

Hi Ben –

Thanks much for the quick and helpful reply! Unfortunately, the
solution you’re using on omdb looks suspect to me, for the same reason
that Alex Neth brought up a few days ago on this list: to my knowledge
there’s no guarantee that rsync will produce a coherent snapshot of
the source directory as it was at any one particular instant in time.
In fact, I don’t see how rsync could both always terminate in finite
time and provide such a guarantee, except on exotic filesystems that
provide, say, atomic snapshots with copy-on-write capabilities.
(Sigh…sometimes I miss the Google File System.) In which case you’d
have to disable your site during the rsync in order to prevent
corruption, which basically boils down to the “must take site offline
daily for a few minutes to deal with this problem” limitation. I’m
guessing the rsync is faster than an index optimization, so I guess
this might at least cut down on the amount of time the site has to be
down, but still…wah.

Am I a fool for wondering whether it might ultimately be less painful
to try an index server that runs Lucene under a JRuby process?

scottd72 · November 16, 2007, 11:40pm

Scott,

we’re using two directories, not one for ferret. One
index is the passive index. it is not used for searches,
but new indexing requests will be added to that index.
so lets call it the indexing-index.

all mongrels will use the second directory, lets call it
searching-index. Both indexes are almost identical,
i’ll explain the differences.

All out indexing requests are queued. So whenever
you want to index something, it will be placed in the
queue, and added to the indexing-index. After a
certain amount of queue-items added to the index,
we’re stopping indexing. The queue will be halted.
New requests can be added, but nothing will be
added to the indexing-index.

Now we’re rsyncing the indexing-index to all machines,
remember, searching is still done in the searching-index,
which is outdated, but we don’t mind about that

After rsync is complete, we’re switching both directories,
so the indexing-index becomes the searching-index and
vice versa. Actually we’re just switching symlinks, so
the this will take almost no time. And even if one of the
mongrels still have a filehandle to the old index open,
nothing will happen, it is still using the outdated index,
but the next request will use the new index. After that,
the new indexing-index will be synced from the
searching-index. As the searching-index is read-only,
there is no risk of corrupting something during the
sync.

Now we’re resuming processing the queue, until we’ve
added our certain amount of queue entries, or the queue
is empty.

The downside is, that the searching-index is outdated,
but not more that a couple of minutes (about 2 minutes
on omdb). We didn’t have one corrupted index since.
There is now downtime whatsoever, and the rsync snapshot
will always be coherent.

Cheers
Ben

scottd72 · November 17, 2007, 1:38am

Ben –

Thanks for the detailed explanation! Yes, that does make sense. If I
understand it correctly, though, something won’t show up in a search
until at least one index switch happens after it’s been submitted,
which means we’re talking about a minute or so on average (not just
worst-case) from submission to search result, even if the switches are
being done constantly (given that each switch takes about two
minutes). For my site, I’m really hoping that most content will show
up within a second or so of its submission. That simply can’t happen
if I’m not updating the same index I’m doing searches with. I’d be OK
with the turnaround occasionally being a minute – say, while an
index optimization or particularly large segment merge happens. But
so far it looks to me like the choices with Ferret are either:

(1) The average time from submission to search result is on the
order of minutes. However, searches are always reasonably fast.
(Your approach.)

(2) The average time from submission to search result is less than a
second. However, the worst-case times can be minutes, and now all
searches stall over those minutes as well, which is Bad. If you
don’t get more than a few thousand submissions per day, you can at
least schedule these outages as nightly index optimizations, but
you’ll have the outages one way or another. (All “same index used for
reading + writing” approaches.)

I don’t think either of these choices is very good for the particular
site I have in mind (at least if I’m being optimistic enough about its
chances of “taking off” to worry about the possibility of many
thousands of submissions / day). Am I correct in my summarization of
the two choices with Ferret here, or have I missed something?

Anyhow, thanks again! If those two options are in fact what I have, I
think I’ll run some tests with Lucene/JRuby to see whether that
provides a third option as far as performance goes, and report back
what sort of issues come up. (My guess is that it’ll be moderately
painful to set up and that the average throughput will be worse than
Ferret’s, but that an average submission-to-search-result turnaround
time of a second or two will be achievable without the site
necessarily going completely down for minutes every now and then.
We’ll see.)

– Scott

scottd72 · November 17, 2007, 3:08am

On Nov 16, 2007, at 3:35 PM, Scott D. wrote:

Am I a fool for wondering whether it might ultimately be less painful
to try an index server that runs Lucene under a JRuby process?

Or, rather, an index server that runs Solr accessed with a pure Ruby,
solr-ruby, API (which works with MRI or JRuby)?

Erik

scottd72 · November 17, 2007, 11:13am

Hmmm…I’d first heard of Solr only a couple of days ago, and I hadn’t
been aware of a Ruby API to it until you mentioned it.
Interesting…thanks!

scottd72 · November 17, 2007, 10:51pm

Hi!

On Fri, Nov 16, 2007 at 02:56:26AM -0800, Scott D. wrote:

can stall all reads for over a minute.
Don’t get me wrong, but how often do you think you’ll add your millionth
document to the index?

And even if you really do index a million documents per week - I
wouldn’t exactly call it bad performance if one or two search requests
per week take a minute to complete, while all others are completed in
less than a second…

Having that said, the problem with blocking searches might be possible
to solve by not using Ferret’s Index class for searching/indexing, but
using the lower level APIs (Searcher and IndexWriter) and doing manual
synchronization (inside one process). I didn’t feel the need to
implement this for aaf (yet ;-), since I think it’s already fast enough
to not be the bottleneck in most real world usage scenarios (say -
typical Rails apps using aaf for full text search).

When I try to use an IndexReader in a separate process, things are
even worse. The IndexReader doesn’t see any updates to the index
since it was created. Not too surprising, but if I try creating a new
IndexReader for every query, and have the Index in the other writing
process turn on auto_flush, then the reading process crashes after a
few (generally fewer than 100) queries, in one of at least two
different ways selected apparently at random:

[…]

Stick to the one-process-per-index rule to be on the safe side.

Given the combination of problems above, I’m at a loss to understand
how to use Ferret on a live website that requires reasonably fast
turnaround between a user submitting data and the user being able to
search over that data, unless either (1) the site only gets a few
thousand new index entries per day and the site can be taken down for
a few minutes daily to optimize the index, or (2) it’s OK for the
entire site to periodically stall on all queries for seconds or even
minutes whenever segment-merging happens to kick in.

I wouldn’t set the limit at a few thousand new documents per day, and
also optimizing daily is only useful if you’re having lots of document
deletions per day.

Cheers,
Jens

PS: If you happen to benchmark Solr against aaf’s DRb server, be sure to
let us know your findings

–
Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

scottd72 · November 18, 2007, 4:05pm

Hi everyone!

This is a very interesting thread, because it raises the question as
to whether Ferret is something you would want to use in a production
environment - or not.

I’ve been using Ferret in two applications and my experiences were
quite disappointing. I chose Ferret because it’s fast and it’s got a
Ruby API. Everything else about it is just annoying and potentially
hazardous.

What worries me most is the fact that Ferret is effectively an
abandoned project. The original author, who is the sole owner of the
code, hasn’t been posting to this list for about six months. He hasn’t
introduced any improvements in about the same period of time and many
bugs still remain unfixed. New bugs can’t be submitted (let alone
patches) because the project Trac is offline.

There is no other component in my applications which behaves as badly
as Ferret. If you don’t treat it very carefully it will throw
segfaults as if this was an established way of indicating an error
condition.

The ActsAsFerret plugin does treat ferret quite carefully and it’s
the only reason why many people are able to use Ferret at all.
However, AAF is one approach and for some applications it might not be
the right one. Especially if you want to put multiple models in one
index - it’s possible, but not really a flexible solution.

The most sensitive point of Ferret is concurrency and many people
actually use Ferret in distributed environments (which is usually a
Rails app that scales across several machines). AAF introduces a DRb
server to work around this problem, but with many concurrent read/
write requests, performance quickly degrades.

With the advent of JRuby, a myriad of Java-based solutions is now
accessible to Ruby developers, including many full-text indices. There
are very mature solutions readily available for production use and
many next-generation search engines currently in development.

For the next application that needs full text search, I’m most
definitely not going to use Ferret. I agree with Erik and give Solr a
shot.

I would like to encourage everyone, who is already using another full
text index for Ruby/Rails to share his/her experiences on this list.
Because I have the feeling that many people would like to get rid of
Ferret for exactly the same reasons I’ve pointed out above.

    Andy

scottd72 · November 18, 2007, 4:48pm

Andy,

You asked about other full text indexes for Ruby/Rails. I am using both
AAF/Ferret and Sphinx in my app.

I haven’t had any problems with Ferret or acts_as_ferret so far. I am
using the DRb server and it is being hit with 200-250,000 requests a day
from dozens of clients (Mongrel instances). My index isn’t huge - it is
about 600 MB.

I’m using Sphinx (http://www.sphinxsearch.com/) wherever I don’t need
realtime updates. A large portion of my site requires search indexes to
be always up-to-date but in many places, I can live with an index that
may
be 5 minutes old. Sphinx trades realtime indexing for performance -
both
search and indexing speed is blazingly fast. Sphinx comes with a server
component that speaks a simple protocol and there are several rails
plugins available.

Sphinx (and acts_as_sphinx or whatever plugin you choose) and
acts_as_ferret are very different animals, but I’m very pleased with the
combination.

Casey

scottd72 · November 18, 2007, 6:51pm

Hi!

On Sun, Nov 18, 2007 at 10:24:34AM -0500, [email protected] wrote:

Andy,

You asked about other full text indexes for Ruby/Rails. I am using both
AAF/Ferret and Sphinx in my app.

I haven’t had any problems with Ferret or acts_as_ferret so far. I am
using the DRb server and it is being hit with 200-250,000 requests a day
from dozens of clients (Mongrel instances). My index isn’t huge - it is
about 600 MB.

ah, glad to see somebody where everything just works standing up and
tell
the world

On Sun, 18 Nov 2007, Andreas K. wrote:
[…]

What worries me most is the fact that Ferret is effectively an
abandoned project. The original author, who is the sole owner of the
code, hasn’t been posting to this list for about six months. He hasn’t
introduced any improvements in about the same period of time and many
bugs still remain unfixed. New bugs can’t be submitted (let alone
patches) because the project Trac is offline.

Trac is online again for days, and Ferret even got a new logo I
wouldn’t call it abandoned, it’s just stabilizing.

There is no other component in my applications which behaves as badly
as Ferret. If you don’t treat it very carefully it will throw
segfaults as if this was an established way of indicating an error
condition.

The ActsAsFerret plugin does treat ferret quite carefully and it’s
the only reason why many people are able to use Ferret at all.
However, AAF is one approach and for some applications it might not be
the right one. Especially if you want to put multiple models in one
index - it’s possible, but not really a flexible solution.

Well, even if aaf doesn’t fit your needs, you might at least have a look
at it if you want to know how to treat your Ferret well I admit it
isn’t always an easy library to deal with, but with a proper set of unit
tests it’s entirely possible and no headache at all. Imho.

The most sensitive point of Ferret is concurrency and many people
actually use Ferret in distributed environments (which is usually a
Rails app that scales across several machines). AAF introduces a DRb
server to work around this problem, but with many concurrent read/
write requests, performance quickly degrades.

AAf’s DRb server can handle some serious load as it is now, but for sure
there’s much room for improvement. However I didn’t receive many
complaints from people actually having this problem in real life
applications yet. Most of the time this is brought up as some kind of
‘what if’ problem. Somebody did a speed comparison of Solr and aaf/Drb a
while back, where aaf was at least as fast as Solr was, with it’s
admittedly naive DRb server.

I don’t say this was a representative benchmark or anything, but it’s
the only numbers I know of…

So please from now on, anybody feeling to blame aaf’s DRb as slow,
please show us some numbers and the test process which led to
these numbers.

Ideally you’d also show us the numbers of any solution you’ve found to
be faster solving the same problem. Thanks.

With the advent of JRuby, a myriad of Java-based solutions is now
accessible to Ruby developers, including many full-text indices. There
are very mature solutions readily available for production use and
many next-generation search engines currently in development.

For sure. I’m excited by these possiblities as well.

Cheers,
Jens

–
Jens Krämer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

scottd72 · November 18, 2007, 8:50pm

On 18.11.2007, at 18:51, Jens K. wrote:

Trac is online again for days, and Ferret even got a new logo I
wouldn’t call it abandoned, it’s just stabilizing.

Yes, I noticed that. I should have checked before posting. However, a
project site that is frequently down for extended periods of time is
not exactly building up trust

AAf’s DRb server can handle some serious load as it is now, but for
sure
there’s much room for improvement. However I didn’t receive many
complaints from people actually having this problem in real life
applications yet. Most of the time this is brought up as some kind of
‘what if’ problem.

My apologies for implying that AAF is part of the problem. It
certainly isn’t. I made the mistake to mix up my concerns about Ferret
with comments on AAF. What I actually meant to say, is that AAF is one
viable way to deal with some of Ferret’s shortcomings.

The fact that in the Rails community AAF is almost synonymous with
Ferret speaks for your plugin and I’m not in a position to question
that.

So please from now on, anybody feeling to blame aaf’s DRb as slow,
please show us some numbers and the test process which led to
these numbers.

Again, I wasn’t to blame AAF here.

To be more precise: Ferret is pretty damn fast. The problem is its
extremely sensitive API which exposes problems from the C
implementation to the Ruby developer. I don’t know of any way to catch
a segfault in Ruby, and even if I did, there’s little I can do about
it from Rubyland.

Without transactional index updates, such behavior is intolerable,
unless you can afford to rebuild your index several times a day. This
leaves us to build another Ruby API on top of Ferret’s in order to
compensate for these imperfections.

I wrote a custom solution with a focus on reliability. But with all
the infrastructure built around Ferret (DRb server, transactions,
queuing), the overall indexing performance wasn’t that great anymore:
Remote indexing with 10 concurrent clients was 8-9 times slower than
local indexing.

Maybe AAF is faster, but since the implementations are different,
there’s no point in comparing them directly.

      Andy

scottd72 · November 18, 2007, 8:57pm

On 18.11.2007, at 16:24, [email protected] wrote:

plugins available.
Thanks, Casey. I’ll take a look at Sphinx. Since I’m primarily
concerned about index consistency and don’t mind short delays either,
it sounds like a pretty good alternative.

      Cheers,
      Andy

scottd72 · November 18, 2007, 6:53pm

On Nov 18, 2007, at 7:05 AM, Andreas K. wrote:

What worries me most is the fact that Ferret is effectively an
abandoned project. The original author, who is the sole owner of the
code, hasn’t been posting to this list for about six months. He hasn’t
introduced any improvements in about the same period of time and many
bugs still remain unfixed.

I have a large fraction of the expertise needed to maintain the C
part of the Ferret code base, FWIW. What I’m missing is significant
Ruby expertise, which I wouldn’t mind accumulating.

If what’s needed is C-level bug fixing, I can probably help out.

New bugs can’t be submitted (let alone
patches) because the project Trac is offline.

I know it’s been down before, but <http://ferret.davebalmain.com/
trac> looks like it’s up to me, now. Also, I see a commit from Dave
bumping the version to 0.11.5 yesterday.

The C code base that I am currently working on, which has a
foundation designed by Dave and I to be shared by multiple host
languages, is going to wind up having Ruby bindings eventually. It
will either happen as part of the Lucy project, or independently.

In the meantime, perhaps I can contribute to Ferret in a caretaker/
troubleshooter role. Dave gave me commit access to the repository a
while ago, and I just verified that I still have it.

Marvin H.
Rectangular Research
http://www.rectangular.com/

scottd72 · November 19, 2007, 1:10am

On Nov 17, 2007, at 5:12 AM, Scott D. wrote:

Hmmm…I’d first heard of Solr only a couple of days ago, and I hadn’t
been aware of a Ruby API to it until you mentioned it.
Interesting…thanks!

I’ve honestly given fairly little of my time to Ferret, though I have
tinkered with it some and it is mighty fine!

Believe you me, I don’t want to steal any thunder from Ferret. And
I’ve not compared/contrasted them much myself. Truth be told I’m
still a Java dude, and knowing that Lucene and Solr are in Java,
excel at what they are designed to do and already gulping the Apache
cool-ade I really dig Solr.

I’ve presented solr+ruby a couple of times now, once at RailsConf and
then again a few weeks ago at rubyconf.

RailsConf:
http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf

rubyconf:
http://code4lib.org/files/solr-ruby.pdf

acts_as_solr as it exists today is sub-optimal compared to
acts_as_ferret. I’m quite admittedly not much into relational
databases so I have only tinkered in this area myself.

Erik

scottd72 · November 21, 2007, 8:54pm

For the record, while Lucene is pretty well-behaved as far as I can
tell, DRb running under JRuby is not. When hit with multiple request
streams simultaneously, DRb under JRuby 1.0.2 very quickly falls over
and stops responding to all queries. DRb under JRuby 1.1b1 almost
works, but every now and then JRuby will freak out and for a few
requests things will fail in very strange ways. (Attempts to
construct Java objects will fail with exceptions such as “undefined
method constructors' for nil:NilClass" or "undefined methodjava_class’ for Class:Class”; sometimes looking up a class will
fail…)

On the plus side, I do get the impression that JRuby development is
pretty active, and I see some concurrency bugs listed as high-priority
for JRuby 1.1, some of which have already been patched in the trunk.
My guess is that JRuby+Lucene+DRb will be a fine choice in a few
months…it was actually pretty painless to set up, even with MLI Ruby
RoR clients talking to a JRuby indexing server. (I have a simple
metaprogramming hack that lets the client specify a sequence of code
to execute on the server side, where the specification looks almost
like normal Ruby code; this effectively lets me easily construct
gnarly Lucene query trees in MLI Ruby clients that know nothing about
Lucene or Java. I actually initially came up with this hack to work
around Ferret’s “query trees and filters don’t marshal” issue.)
JRuby’s not ready for serious use in scenarios with concurrency just
yet, though.

Meanwhile, I’m hoping to avoid Solr because it seems (1) kind of
complicated for what I’d actually get out of it in my particular
application, (2) not particularly well-documented given its size, and
(3) likely to get in my way when I want to do anything low-level and
gnarly with Lucene.

I guess I’ll continue limping along with Ferret for the moment and
hope the concurrency issues get worked out soonish. Has anyone
actually decided specifically to make Ferret bulletproof in the face
of concurrency over the next few months, or is it probably just not
going to happen? If it doesn’t, I suspect Ferret will probably fall
by the wayside as more Ruby people jump ship for Lucene-based
solutions. Which would be a shame, because Ferret does hold a lot of
promise…indexing is hard, and Ferret is almost a great solution.
(Too bad the last 20% is usually 80% of the work…)

– Scott

scottd72 · November 21, 2007, 9:26pm

On Nov 21, 2007, at 2:53 PM, Scott D. wrote:

My guess is that JRuby+Lucene+DRb will be a fine choice in a few
months…

Definitely not a bad choice. However I still implore you to give
Solr another chance. More on that…

Meanwhile, I’m hoping to avoid Solr because it seems (1) kind of
complicated for what I’d actually get out of it in my particular
application

How so? It’s a “search server” with the same goals that I imagine
you’d have for the JRuby+Lucene+DRb combination.

It’s not really complicated, especially with the solr-ruby library.
Add documents, delete them, query for them. Leverage highlighting
and more-like these features, dismax querying, etc.

, (2) not particularly well-documented given its size

Wow. Have you seen the Solr wiki? Apache Solr Wiki - Solr - Apache Software Foundation -
there are nooks and crannies documented on that wiki that go well
beyond what I’d consider good documentation.

By all means point me to areas that aren’t documented that you need
to know (off list) and I’ll get those taken care of.

(3) likely to get in my way when I want to do anything low-level and
gnarly with Lucene.

Maybe, but not much in your way. You’d have to wrap your low-level
mojo inside some Solr API perhaps, but not even if we’re just talking
about custom analyzers or similarity implementation.

Which would be a shame, because Ferret does hold a lot of
promise…

hear hear! I definitely extend major kudos to Dave and the other
Ferret contributors. Great stuff.

Erik

scottd72 · November 21, 2007, 11:07pm

On Nov 21, 2007 12:24 PM, Erik H. [email protected]
wrote:

How so? It’s a “search server” with the same goals that I imagine
you’d have for the JRuby+Lucene+DRb combination.

It’s a bit more than I need right out of the gate, what with the
caching, replication, faceted search, etc. Of course, that might not
be a problem if it uses sensible configuration defaults I can safely
ignore to start with.

It’s not really complicated, especially with the solr-ruby library.
Add documents, delete them, query for them. Leverage highlighting
and more-like these features, dismax querying, etc.

My particular application does enough weird things that, for the most
part, I’d prefer unfettered access to the low-level Lucene APIs. (For
example, my application uses a lot of gnarly query trees involving
filters and ranges, and I’m not sure whether those are easily
transmitted through the Solr APIs. Then I have “run all of these
queries against each of the documents in this specific set and tell me
which document/query pairs match in one fell swoop” routines, in which
case it might be a good idea to copy the documents into a temporary
RAM index to run the queries against.)

, (2) not particularly well-documented given its size

Wow. Have you seen the Solr wiki? Apache Solr Wiki - Solr - Apache Software Foundation -
there are nooks and crannies documented on that wiki that go well
beyond what I’d consider good documentation.

By all means point me to areas that aren’t documented that you need
to know (off list) and I’ll get those taken care of.

Wikis are fine for looking up details when you already mostly know
what you’re doing, but they’re not nearly as useful when you’re in the
earlier stages trying to get the big “What does this system look like
and how does it work?” picture and evaluate initial plans of attack.
Ferret and Lucene both have entire books written about them that are
excellent for those purposes. (They’re not free-as-in-beer, but are
well worth the cost.) By comparison, Solr has a very simple “here is
how you get a straightforward app off the ground” tutorial that says
little about how Solr is actually organized, and then you’re basicaly
left staring at a Wiki page with a thousand bullet points and no clear
path to big-picture enlightenment. And given the choice between (1)
using a lower-level system that’s been very well-documented in a
well-organized explanatory fashion and (2) using a slightly
higher-level system I still haven’t acquired a mental “big picture”
for, I generally find (1) more productive.

This isn’t a criticism of Solr’s documentation nearly as much as a
hearty “Book-style documentation is useful, and, holy crap, Ferret and
Lucene actually HAVE IT. Woohoo!”, plus an added bonus testament to
my own laziness.

(3) likely to get in my way when I want to do anything low-level and
gnarly with Lucene.

Maybe, but not much in your way. You’d have to wrap your low-level
mojo inside some Solr API perhaps, but not even if we’re just talking
about custom analyzers or similarity implementation.

Yeah, my guess is that if I sit down and figure out how Solr is laid
out, adding APIs to do what I want won’t be too hard. Might still be
kind of tedious implementing all the necessary marshaling, though.

– Scott

scottd72 · November 19, 2007, 1:46am

Great. For my own curiosity, and maybe people here share some of it:

Is it possible to write your own custom analyzers for Solr? If so, how
easy it is? Can one do that in Ruby or do I have to write it in Java?

I personally think that’s one of the greatest things about Ferret. So
far I haven’t bothered looking into Sphinx or Solr precisely because,
from a glance, I couldn’t find a way to customize anything in detail
like I can do with Ferret. I assume there is a way…

Thing is, reading through the Ferret booklet (the one from OReilly),
you get a glimpse of how easy it is to build custom solutions using
it. So whereas it’s kind of sad that the lead developer has been
distant from the project in the last few months (?), I have to say,
there’s hardly matching how easy it is to work with it.

scottd72 · December 7, 2007, 10:46am

Hi Guys,

I recently worked with a client that had performance issues with trying
to get all their CPU’s pumping in production. I advised them and helped
them to get multiple processes running on different ports to read/search
access only to an index.

They are still working on the write index - but theoretically the
indexes can be reloaded when a index directory is placed in the index
path.

Read more - i put some notes on my blog. http://kalvir.blogspot.com/ and
there’s a link to the patches we made to the ferret code.

Kalv.