Hi all, Have been looking at lucene and ferret. Have noticed that ferret takes ~463 seconds to index 200Mb of docs, whereas lucene takes ~60 seconds. I'm using the standard "get you started" sort of code provided by both libraries. My ruby code is: (abridged) @index = Index::Index.new(:path => inIndexPath) def createIndex(inRepositoryPath) Find.find(inRepositoryPath) do |path| if FileTest.file?(path) File.open(path) do |file| @index.add_document(:file =>path, :content => file.readlines) end My Java code is basically a direct port. Has anyone else noticed this difference in speed? Am I doing something wrong? Is this speed normal? Any advice gratefully received. Thanks, Steven
on 2006-05-02 17:16
on 2006-05-03 19:06
Hi Steven, Are the indexes you get the same size? My guess is that the code isn't really equivalent. Ferret should be faster than Lucene. Try this; include Ferret::Document @index = Index::Index.new(:path => inIndexPath) def createIndex(inRepositoryPath) Find.find(inRepositoryPath) do |path| if FileTest.file?(path) File.open(path) do |file| doc = Document.new() doc << Field.new(:file, path, Field::Store::YES, Field::Index::UNTOKENIZED) doc << Field.new(:content, file.readlines, Field::Store::NO, Field::Index::TOKENIZED) @index << doc end end end end Let me know if this helps. Cheers, Dave
on 2006-05-05 15:15
Hi Dave, Thanks very much for getting back to me. You were right about the indexes being different... Your snippet has helped - but still nowhere near as fast as the Java version: doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("contents", new FileReader(f))); Could it be that ruby's file.readlines is slower than Java's FileReader? Another possible snafu is that the Directory contains loads of pdfs and other binary files which neither lucene or ferret can index - could it be that ferret is slower at dealing with things like that? (Just a thought) Would love to hear any thoughts. Many Thanks, Steven.
on 2006-05-05 16:42
Hi Steven, Once you made those changes were the indexes approximately the same size? You'll get the most accurate results if the indexes are identical. Also, which version of Ferret are you using? I just tried 200Mb here (~600 files). In my case all of it is text and everything gets indexed. Lucene took ~120 seconds and Ferret took ~55 seconds. Both indexes are identical. I'm using the Sun JVM. I look forward to your reply. Cheers, Dave
on 2006-05-11 17:18
Just for completeness' sake... After conversations offline with David, it turns out I have been working with the pure ruby version of ferret, without the C extensions, obviously explaining the slower performance.