Benchmark - Thanks Dave for making this gnawer this FAST!

janpipek · May 12, 2006, 11:38pm

Hi List,

I’ve took some time and made some tests on the performance of
java-lucene, hyperestraier and ferret as Dave encourages the community
of ferret to do so.

Quite intersting numbers. Ferret indeed deserves to be called a
high-performance port!!

It’s MyFirstBenchmark (
http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmark ) so please
don’t be too cruel on critizing the method. It’s just a hack and it’s
flawed - as every other benchmark. But it provides some numbers and
regardlass how flawed it is, one thing remains true: All of these search
engines are fast enough for most of us…

Regards
Jan

janpipek · May 16, 2006, 1:41am

On May 12, 2006, at 2:38 PM, Jan P. wrote:

Hi List,

I’ve took some time and made some tests on the performance of
java-lucene, hyperestraier and ferret as Dave encourages the community
of ferret to do so.

Hello, Jan… On the benchmarking page you make this request.

“If you are an expert in one of these search-engines than provide
some information about the best optimizations.”

As the author of another Lucene port (KinoSearch, Perl/C), I know a
fair amount about Lucene. Better, I put together some benchmarks
comparing Lucene, KinoSearch and Plucene, a little while ago <http://
www.rectangular.com/kinosearch/benchmarks.html>, and I solicited the
help of the Lucene developers list to help tune the Lucene
benchmarking app. By the end it performed around twice as well as my
initial version.

In order to max out Lucene’s indexing speed…

Don’t use the compound file format:
indexWriter.setUseCompoundFile(false);
Set maxBufferedDocs to at least 100, and if you have the RAM, 1000:
indexWriter.setMaxBufferedDocs(1000);
Give the JVM a generous heap and run it under -server:
java -Xmx500M -server MyIndexer
Make sure that JVM startup time is not factored into the results
unless you intend it to be.

All this in addition to good stuff like warming up OS caches with dry
runs prior to test runs, ensuring that the machine is otherwise idle,
making sure that the analyzers are exactly equivalent (the fact that
the search results differ is a red flag – I’d use WhiteSpaceAnalyzer
instead of whatever you’re using), and other such steps to isolate
the variables you intend to measure. Then, perform multiple iterations.

It’s MyFirstBenchmark (
http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmark ) so please
don’t be too cruel on critizing the method.

It’s very difficult to run a good scientific experiment of any kind.
In fact my current results are flawed – I left out a call to optimize
() in the Lucene benchmark, so Lucene performs not quite so well as
the numbers on my page would indicate. But I’d rather err on that
side than on the giving the engine I’m attached to a leg up.

one thing remains true: All of these search
engines are fast enough for most of us…

Yes. Things are different than they were just a couple years ago.

Marvin H.
Rectangular Research
http://www.rectangular.com/

janpipek · May 16, 2006, 9:04am

Hi, Marvin,

thank you very much. I will take these advices into account when I’m
doing
other tests. As a first step I’ll add a link to your post to the ferret
wiki
to let people know…

Regards
Jan P.