Indexing Speed?

casper_the_ghost · May 2, 2006, 5:16pm

Hi all,

Have been looking at lucene and ferret.

Have noticed that ferret takes ~463 seconds to index 200Mb of docs,
whereas lucene takes ~60 seconds.

I’m using the standard “get you started” sort of code provided by both
libraries.

My ruby code is: (abridged)

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
Find.find(inRepositoryPath) do |path|
if FileTest.file?(path)
File.open(path) do |file|
@index.add_document(:file =>path, :content =>
file.readlines)
end

My Java code is basically a direct port.

Has anyone else noticed this difference in speed? Am I doing something
wrong? Is this speed normal?

Any advice gratefully received.
Thanks,
Steven

casper_the_ghost · May 3, 2006, 7:06pm

Hi Steven,

Are the indexes you get the same size? My guess is that the code isn’t
really equivalent. Ferret should be faster than Lucene. Try this;

include Ferret::Document

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
Find.find(inRepositoryPath) do |path|
if FileTest.file?(path)
File.open(path) do |file|
doc = Document.new()
doc << Field.new(:file, path,
Field::Store::YES,
Field::Index::UNTOKENIZED)
doc << Field.new(:content, file.readlines,
Field::Store::NO, Field::Index::TOKENIZED)
@index << doc
end
end
end
end

Let me know if this helps.

Cheers,
Dave

casper_the_ghost · May 5, 2006, 3:15pm

Hi Dave,

Thanks very much for getting back to me.

You were right about the indexes being different…

Your snippet has helped - but still nowhere near as fast as the Java
version:

doc.add(new Field(“path”, f.getPath(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field(“modified”,DateTools.timeToString(f.lastModified(),
DateTools.Resolution.MINUTE), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field(“contents”, new FileReader(f)));

Could it be that ruby’s file.readlines is slower than Java’s FileReader?

Another possible snafu is that the Directory contains loads of pdfs and
other binary files which neither lucene or ferret can index - could it
be that ferret is slower at dealing with things like that? (Just a
thought)

Would love to hear any thoughts.

Many Thanks,
Steven.

casper_the_ghost · May 5, 2006, 4:42pm

Hi Steven,

Once you made those changes were the indexes approximately the same
size? You’ll get the most accurate results if the indexes are
identical. Also, which version of Ferret are you using? I just tried
200Mb here (~600 files). In my case all of it is text and everything
gets indexed. Lucene took ~120 seconds and Ferret took ~55 seconds.
Both indexes are identical. I’m using the Sun JVM.

I look forward to your reply.

Cheers,
Dave

casper_the_ghost · May 11, 2006, 5:18pm

Just for completeness’ sake…

After conversations offline with David, it turns out I have been working
with the pure ruby version of ferret, without the C extensions,
obviously explaining the slower performance.