Forum: Ferret Indexing Speed?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Be6d0e2b2bc4771d40dc5a385a51c307?d=identicon&s=25 steven (Guest)
on 2006-05-02 17:16
Hi all,

Have been looking at lucene and ferret.

Have noticed that ferret takes ~463 seconds to index 200Mb of docs,
whereas lucene takes ~60 seconds.

I'm using the standard "get you started" sort of code provided by both
libraries.

My ruby code is: (abridged)

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
    Find.find(inRepositoryPath) do |path|
        if FileTest.file?(path)
            File.open(path) do |file|
                 @index.add_document(:file =>path, :content =>
file.readlines)
end

My Java code is basically a direct port.

Has anyone else noticed this difference in speed? Am I doing something
wrong? Is this speed normal?

Any advice gratefully received.
Thanks,
Steven
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-03 19:06
(Received via mailing list)
Hi Steven,

Are the indexes you get the same size? My guess is that the code isn't
really equivalent. Ferret should be faster than Lucene. Try this;

include Ferret::Document

@index = Index::Index.new(:path => inIndexPath)

def createIndex(inRepositoryPath)
    Find.find(inRepositoryPath) do |path|
        if FileTest.file?(path)
            File.open(path) do |file|
                doc = Document.new()
                doc << Field.new(:file, path,
                              Field::Store::YES,
Field::Index::UNTOKENIZED)
                doc << Field.new(:content, file.readlines,
                              Field::Store::NO, Field::Index::TOKENIZED)
                @index << doc
            end
        end
    end
end

Let me know if this helps.

Cheers,
Dave
Be6d0e2b2bc4771d40dc5a385a51c307?d=identicon&s=25 steven (Guest)
on 2006-05-05 15:15
Hi Dave,

Thanks very much for getting back to me.

You were right about the indexes being different...

Your snippet has helped - but still nowhere near as fast as the Java
version:

doc.add(new Field("path", f.getPath(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("modified",DateTools.timeToString(f.lastModified(),
DateTools.Resolution.MINUTE), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("contents", new FileReader(f)));

Could it be that ruby's file.readlines is slower than Java's FileReader?

Another possible snafu is that the Directory contains loads of pdfs and
other binary files which neither lucene or ferret can index - could it
be that ferret is slower at dealing with things like that? (Just a
thought)

Would love to hear any thoughts.

Many Thanks,
Steven.
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-05-05 16:42
(Received via mailing list)
Hi Steven,

Once you made those changes were the indexes approximately the same
size? You'll get the most accurate results if the indexes are
identical. Also, which version of Ferret are you using? I just tried
200Mb here (~600 files). In my case all of it is text and everything
gets indexed. Lucene took ~120 seconds and Ferret took ~55 seconds.
Both indexes are identical. I'm using the Sun JVM.

I look forward to your reply.

Cheers,
Dave
1cb5d53d5bbf3e3afa7960ef55c240e4?d=identicon&s=25 Steven Shingler (sshingler)
on 2006-05-11 17:18
Just for completeness' sake...

After conversations offline with David, it turns out I have been working
with the pure ruby version of ferret, without the C extensions,
obviously explaining the slower performance.
This topic is locked and can not be replied to.