Ferret slow after a while

I’m building a new index from scratch based on a number of documents
stored in a database loaded using my Rails env (using Ruby Ferret 0.9x
(installed today with Gem) on Windows). At first everything goes nice
but after a number of documents it starts to go slower and slower until
it grinds to a halt (at least feels like it).

Am I doing something wrong? Is there some way to work around this?

/Marcus

Code in question:

ENV[‘RAILS_ENV’] ||= ‘development’
puts “Environment : #{ENV[‘RAILS_ENV’]}”

require ‘config/environment.rb’

require ‘ferret’

index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create
=> true)
Node.find_all_by_type(“PageNode”).each { |content|
puts “ID: #{content.id} => name: #{content.title}”
index << content.to_doc if content.respond_to?(“to_doc”)
}
index.flush
index.optimize
index.close

Hi, Marcus,

by using Ferret 0.9.3 on windows you are using the ‘pure pure’ ruby
version.
As I’ve read some time ago someone - i think it was jens kraemer -
suggested
that on windows downgrading to 0.3.2 might be a good idea, because this
version comes with a native extension (not as feature rich as cFerret of
course but a predecessor) even on windows. Pure ruby - as clean and
wonderful the language is - is slow comparing it to java or C and
therefore
pure ruby ferret isn’t really the first choice for building up an index
of a
large document set.

Another possibility you might want to think about while waiting for
cFerret
on Windows could be to do the initial huge indexing batch on a linux or
osx/freebsd machine, transfer the index and perform only ongoing updates
on
windows.

Regardless what I’ve said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?

Regards
Jan

Jan P. wrote:

Regardless what I’ve said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?

After doing quite a bit more of testing it seems that speed seems to be
content dependant. The content is ugly test content it seems where
someone have just made random key strokes.

Content that it shokes on is down at the end.

/Marcus

Each document is built this way (documents may contain UTF-8 chars but I
ignore that for now):

class Node < ActiveRecord::Base
acts_as_ferret …
end

class PageNode < Node
def to_doc
doc = super
page.content_items.each { |item| item.to_doc(doc) if
item.searchable? } if page
doc
end
end

class ContentItem
def to_doc(doc)
doc << Ferret::Document::Field.new(
‘content_item’, self.content,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
end
end

Content:

Huvudrubrik svart

ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj

Huvudrubrik orange

sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd
fsd fsdfsd
fsdfsdfsddfdsdfsdf

Underrubrik svart

dfgfgdfgdfgdfgdfgdfgdfgdf
gdfgdgdfgkjhdfkjghdkjgh dkjghd kgjhd kgfjh d

Underrubrik orange

lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg
dfgdfgdfgdfgdfgdfg

Styckerubrik svart
fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh jflgkhjflgkhf
ghfghfgh
fgh
fghfghfghgfh

Styckerubrik orange
fghkfgjhlfgkjhflkghj flghkjfgl hkjfg lhkfgjhlfgkhfghfgh

Hi, Marcus,

I don’t know too much about the internals of ferret. But I’m not too
much
surprised that ferret is choking on this ‘content’. As all fulltext
search
engines ferret will presume that it’s human readable language that is
going
to be indexed. It would be only because of coincidence that tests of the
stemming, analyzing (and so on) algorithms won’t fail, which results in
lengthy parsings at least.

Is it only because of problems to get ‘real world’ test content? You’ll
find
loads of content on http://www.gutenberg.org/ for example…

Regards
Jan

This is actually content from the customer’s database. Most of the
content in the database is real (it’s actually in live deployment).
Problem seems to be that they have created a number of test pages in the
beginning that is still there.

How do I as a developer ensure that the content isn’t of a form that
Ferret chokes on? I mean, even if I take the test data out now, I cannot
guarantee someone else will put similar data into the database again.
Then it’s me, the developer, who will take the blame when search isn’t
working.

It must be possible to either:

  • Somehow test the data before indexing to ensure it’s not “deadly”
  • The indexing algorithm should skip after a (configurable) time if it’s
    stuck on a small chunk of data.

(or something like it)

Would it help in this case to replace -tags with spaces (as those
aren’t significant anyway)?

Regards
Marcus

ps. Thanks for the comments.

Marcus A. wrote:

Would it help in this case to replace -tags with spaces (as those
aren’t significant anyway)?

Answering to myself here: No, it don’t (after testing…)

Marcus

More testing:

This document (with several fields in it) took 15 seconds to index:
Field: new item
Field: Presentationsmaterial
Field: Ppt-presentationer
Field:  
Field: new item
Field: new item
Field: new item

A bit long for that little content if you ask me. I have several similar
documents that take a lot of time (“new item” is an ugly default value
that all content items get from the beginning, don’t ask me why, does it
affect indexing speed when a lot of documents contains similar tokens?).

But, I don’t know. I’m using the Ruby version. That is supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle
a document of this size? What affects indexing speed?

Regards,
Marcus

Hi, Marcus,

as you may read in
http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmarkthe indexing
of 408MB project gutenberg files took around 1min. To give you
an impression of the indexing speed.

I haven’t got the time right now to test the performance on a windows
box
and with cFerret. Maybe anyone else is possible to jump in. but 15
seconds
for this document is obviously strange.

cheers,
Jan

Jan P. wrote:

Hi, Marc,

if it would be of any help to you and you’ve got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows
box
and we are able to compare the results…

cheers,
Jan

Thanks for your time. I think I wait for the windows C version though.
Implemented an ugly straight db search for the time being.

Regards,

Marcus

Hi, Marc,

if it would be of any help to you and you’ve got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows
box
and we are able to compare the results…

cheers,
Jan

On 5/26/06, Marcus A. [email protected] wrote:

A bit long for that little content if you ask me. I have several similar
documents that take a lot of time (“new item” is an ugly default value
that all content items get from the beginning, don’t ask me why, does it
affect indexing speed when a lot of documents contains similar tokens?).

But, I don’t know. I’m using the Ruby version. That is supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle
a document of this size? What affects indexing speed?

Hi Marcus,

I just tested this here;

require 'lib/rferret.rb'

include Ferret
include Ferret::Document
include Ferret::Index

doc = Document.new
doc << Field.new(:field, "new item")
doc << Field.new(:field, "Presentationsmaterial")
doc << Field.new(:field, "Ppt-presentationer")
doc << Field.new(:field, "&nbsp;")
doc << Field.new(:field, "new item")
doc << Field.new(:field, "new item")
doc << Field.new(:field, "new item")

i = Index.new(:path => "index_dir")
i << doc
i.close

dbalmain@ubuntu:~/workspace/ferret $ time ruby test.rb

real 0m0.147s
user 0m0.125s
sys 0m0.022s

This is with the pure ruby version. If this document is taking 15
seconds then something is going wrong. Similarly the bad data should
hurt indexing speed considerably although it will make your index
larger than usual and merging will take a little longer. Could you
post a simple testcase that takes a long time for you?

Cheers,
Dave