I’m building a new index from scratch based on a number of documents
stored in a database loaded using my Rails env (using Ruby Ferret 0.9x
(installed today with Gem) on Windows). At first everything goes nice
but after a number of documents it starts to go slower and slower until
it grinds to a halt (at least feels like it).
Am I doing something wrong? Is there some way to work around this?
by using Ferret 0.9.3 on windows you are using the ‘pure pure’ ruby
version.
As I’ve read some time ago someone - i think it was jens kraemer -
suggested
that on windows downgrading to 0.3.2 might be a good idea, because this
version comes with a native extension (not as feature rich as cFerret of
course but a predecessor) even on windows. Pure ruby - as clean and
wonderful the language is - is slow comparing it to java or C and
therefore
pure ruby ferret isn’t really the first choice for building up an index
of a
large document set.
Another possibility you might want to think about while waiting for
cFerret
on Windows could be to do the initial huge indexing batch on a linux or
osx/freebsd machine, transfer the index and perform only ongoing updates
on
windows.
Regardless what I’ve said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?
Regardless what I’ve said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?
After doing quite a bit more of testing it seems that speed seems to be
content dependant. The content is ugly test content it seems where
someone have just made random key strokes.
Content that it shokes on is down at the end.
/Marcus
Each document is built this way (documents may contain UTF-8 chars but I
ignore that for now):
class Node < ActiveRecord::Base
acts_as_ferret …
end
class PageNode < Node
def to_doc
doc = super
page.content_items.each { |item| item.to_doc(doc) if
item.searchable? } if page
doc
end
end
class ContentItem
def to_doc(doc)
doc << Ferret::Document::Field.new(
‘content_item’, self.content,
Ferret::Document::Field::Store::NO,
Ferret::Document::Field::Index::TOKENIZED)
end
end
I don’t know too much about the internals of ferret. But I’m not too
much
surprised that ferret is choking on this ‘content’. As all fulltext
search
engines ferret will presume that it’s human readable language that is
going
to be indexed. It would be only because of coincidence that tests of the
stemming, analyzing (and so on) algorithms won’t fail, which results in
lengthy parsings at least.
Is it only because of problems to get ‘real world’ test content? You’ll
find
loads of content on http://www.gutenberg.org/ for example…
This is actually content from the customer’s database. Most of the
content in the database is real (it’s actually in live deployment).
Problem seems to be that they have created a number of test pages in the
beginning that is still there.
How do I as a developer ensure that the content isn’t of a form that
Ferret chokes on? I mean, even if I take the test data out now, I cannot
guarantee someone else will put similar data into the database again.
Then it’s me, the developer, who will take the blame when search isn’t
working.
It must be possible to either:
Somehow test the data before indexing to ensure it’s not “deadly”
The indexing algorithm should skip after a (configurable) time if it’s
stuck on a small chunk of data.
(or something like it)
Would it help in this case to replace -tags with spaces (as those
aren’t significant anyway)?
This document (with several fields in it) took 15 seconds to index:
Field: new item
Field: Presentationsmaterial
Field: Ppt-presentationer
Field:
Field: new item
Field: new item
Field: new item
A bit long for that little content if you ask me. I have several similar
documents that take a lot of time (“new item” is an ugly default value
that all content items get from the beginning, don’t ask me why, does it
affect indexing speed when a lot of documents contains similar tokens?).
But, I don’t know. I’m using the Ruby version. That is supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle
a document of this size? What affects indexing speed?
I haven’t got the time right now to test the performance on a windows
box
and with cFerret. Maybe anyone else is possible to jump in. but 15
seconds
for this document is obviously strange.
if it would be of any help to you and you’ve got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows
box
and we are able to compare the results…
cheers,
Jan
Thanks for your time. I think I wait for the windows C version though.
Implemented an ugly straight db search for the time being.
if it would be of any help to you and you’ve got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows
box
and we are able to compare the results…
A bit long for that little content if you ask me. I have several similar
documents that take a lot of time (“new item” is an ugly default value
that all content items get from the beginning, don’t ask me why, does it
affect indexing speed when a lot of documents contains similar tokens?).
But, I don’t know. I’m using the Ruby version. That is supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle
a document of this size? What affects indexing speed?
Hi Marcus,
I just tested this here;
require 'lib/rferret.rb'
include Ferret
include Ferret::Document
include Ferret::Index
doc = Document.new
doc << Field.new(:field, "new item")
doc << Field.new(:field, "Presentationsmaterial")
doc << Field.new(:field, "Ppt-presentationer")
doc << Field.new(:field, " ")
doc << Field.new(:field, "new item")
doc << Field.new(:field, "new item")
doc << Field.new(:field, "new item")
i = Index.new(:path => "index_dir")
i << doc
i.close
dbalmain@ubuntu:~/workspace/ferret $ time ruby test.rb
real 0m0.147s
user 0m0.125s
sys 0m0.022s
This is with the pure ruby version. If this document is taking 15
seconds then something is going wrong. Similarly the bad data should
hurt indexing speed considerably although it will make your index
larger than usual and merging will take a little longer. Could you
post a simple testcase that takes a long time for you?
Cheers,
Dave
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.