Forum: Ferret Ferret slow after a while

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Marcus A. (Guest)
on 2006-05-24 19:37
I'm building a new index from scratch based on a number of documents
stored in a database loaded using my Rails env (using Ruby Ferret 0.9x
(installed today with Gem) on Windows). At first everything goes nice
but after a number of documents it starts to go slower and slower until
it grinds to a halt (at least feels like it).

Am I doing something wrong? Is there some way to work around this?

/Marcus

Code in question:

ENV['RAILS_ENV'] ||= 'development'
puts "Environment : #{ENV['RAILS_ENV']}"

require 'config/environment.rb'

require 'ferret'

index = Ferret::Index::Index.new( :path => Node.class_index_dir, :create
=> true)
Node.find_all_by_type("PageNode").each { |content|
  puts "ID: #{content.id} => name: #{content.title}"
  index << content.to_doc if content.respond_to?("to_doc")
}
index.flush
index.optimize
index.close
Jan P. (Guest)
on 2006-05-24 20:57
(Received via mailing list)
Hi, Marcus,

by using Ferret 0.9.3 on windows you are using the 'pure pure' ruby
version.
As I've read some time ago someone - i think it was jens kraemer -
suggested
that on windows downgrading to 0.3.2 might be a good idea, because this
version comes with a native extension (not as feature rich as cFerret of
course but a predecessor) even on windows. Pure ruby - as clean and
wonderful the language is - is slow comparing it to java or C and
therefore
pure ruby ferret isn't really the first choice for building up an index
of a
large document set.

Another possibility you might want to think about while waiting for
cFerret
on Windows could be to do the initial huge indexing batch on a linux or
osx/freebsd machine, transfer the index and perform only ongoing updates
on
windows.

Regardless what I've said before: What performance are you experiencing
with
your pure ruby installation? How much datasets do you need to index
initially? When (after how much datasets) are you experiencing the
bottleneck?

Regards
Jan
Marcus A. (Guest)
on 2006-05-25 17:22
Jan P. wrote:
>
> Regardless what I've said before: What performance are you experiencing
> with
> your pure ruby installation? How much datasets do you need to index
> initially? When (after how much datasets) are you experiencing the
> bottleneck?
>
After doing quite a bit more of testing it seems that speed seems to be
content dependant. The content is ugly test content it seems where
someone have just made random key strokes.

Content that it shokes on is down at the end.

/Marcus

Each document is built this way (documents may contain UTF-8 chars but I
ignore that for now):

class Node < ActiveRecord::Base
  acts_as_ferret ...
end

class PageNode < Node
  def to_doc
    doc = super
    page.content_items.each { |item| item.to_doc(doc) if
item.searchable? } if page
    doc
  end
end

class ContentItem
  def to_doc(doc)
    doc <<  Ferret::Document::Field.new(
              'content_item', self.content,
              Ferret::Document::Field::Store::NO,
              Ferret::Document::Field::Index::TOKENIZED)
  end
end

Content:







<h1>Huvudrubrik
svart</h1>ldfkgjdflkgjdflkgjdflgkdflgkdflgkjdflkgj<br><br><h2>Huvudrubrik
orange</h2>sdlkfjsdfkljsdlfksjdflsjflskfjslkfjslkdfsd<br>fsd
fsdfsd<br>fsdfsdfsddfdsdfsdf<br><h3>Underrubrik
svart</h3><p>dfgfgdfgdfgdfgdfgdfgdfgdf<br>gdfgdgdfgkjhdfkjghdkjgh dkjghd
kgjhd kgfjh d<br></p><h4>Underrubrik
orange</h4>lkdfjgldfkgjdlfkgjdlfkgjdflkgdfg<br>dfgdfgdfgdfgdfgdfg<br><br><h5>Styckerubrik
svart</h5>fghfhfghfhkfjglhkjfglhkfjhlkfjghlfkhjflkgh
jflgkhjflgkhf<br>ghfghfgh<br>fgh<br>fghfghfghgfh<br><br><h6>Styckerubrik
orange</h6>fghkfgjhlfgkjhflkghj flghkjfgl hkjfg lhkfgjhlfgkhfghfgh<br>
Jan P. (Guest)
on 2006-05-25 17:51
(Received via mailing list)
Hi, Marcus,

I don't know too much about the internals of ferret. But I'm not too
much
surprised that ferret is choking on this 'content'. As all fulltext
search
engines ferret will presume that it's human readable language that is
going
to be indexed. It would be only because of coincidence that tests of the
stemming, analyzing (and so on) algorithms won't fail, which results in
lengthy parsings at least.

Is it only because of problems to get 'real world' test content? You'll
find
loads of content on http://www.gutenberg.org/ for example...

Regards
Jan
Marcus A. (Guest)
on 2006-05-25 19:15
This is actually content from the customer's database. Most of the
content in the database is real (it's actually in live deployment).
Problem seems to be that they have created a number of test pages in the
beginning that is still there.

How do I as a developer ensure that the content isn't of a form that
Ferret chokes on? I mean, even if I take the test data out now, I cannot
guarantee someone else will put similar data into the database again.
Then it's me, the developer, who will take the blame when search isn't
working.

It must be possible to either:

- Somehow test the data before indexing to ensure it's not "deadly"
- The indexing algorithm should skip after a (configurable) time if it's
stuck on a small chunk of data.

(or something like it)

Would it help in this case to replace <html>-tags with spaces (as those
aren't significant anyway)?

Regards
Marcus

ps. Thanks for the comments.
Marcus A. (Guest)
on 2006-05-25 19:23
Marcus A. wrote:
>
> Would it help in this case to replace <html>-tags with spaces (as those
> aren't significant anyway)?

Answering to myself here: No, it don't (after testing...)

Marcus
Marcus A. (Guest)
on 2006-05-25 20:21
More testing:

This document (with several fields in it) took 15 seconds to index:
Field: new item
Field: Presentationsmaterial
Field: Ppt-presentationer
Field: &nbsp;
Field: new item
Field: new item
Field: new item

A bit long for that little content if you ask me. I have several similar
documents that take a lot of time ("new item" is an ugly default value
that all content items get from the beginning, don't ask me why, does it
affect indexing speed when a lot of documents contains similar tokens?).

But, I don't know. I'm using the Ruby version. That is supposed to be
slow. Maybe the super fast C implementation should take 150ms to handle
a document of this size? What affects indexing speed?

Regards,
Marcus
Jan P. (Guest)
on 2006-05-25 20:32
(Received via mailing list)
Hi, Marcus,

as you may read in
http://ferret.davebalmain.com/trac/wiki/MyFirstBenchmarkthe indexing
of 408MB project gutenberg files took around 1min. To give you
an impression of the indexing speed.

I haven't got the time right now to test the performance on a windows
box
and with cFerret. Maybe anyone else is possible to jump in. but 15
seconds
for this document is obviously strange.

cheers,
Jan
Jan P. (Guest)
on 2006-05-25 20:54
(Received via mailing list)
Hi, Marc,

if it would be of any help to you and you've got the time to make some
preperations you might send me a test.sql (or migration) with a little
testdata and your essential AR-models. Then I may test it on a windows
box
and we are able to compare the results...

cheers,
Jan
Marcus A. (Guest)
on 2006-05-26 21:50
Jan P. wrote:
> Hi, Marc,
>
> if it would be of any help to you and you've got the time to make some
> preperations you might send me a test.sql (or migration) with a little
> testdata and your essential AR-models. Then I may test it on a windows
> box
> and we are able to compare the results...
>
> cheers,
> Jan

Thanks for your time. I think I wait for the windows C version though.
Implemented an ugly straight db search for the time being.

Regards,

Marcus
David B. (Guest)
on 2006-05-27 03:08
(Received via mailing list)
On 5/26/06, Marcus A. <removed_email_address@domain.invalid> wrote:
>
> A bit long for that little content if you ask me. I have several similar
> documents that take a lot of time ("new item" is an ugly default value
> that all content items get from the beginning, don't ask me why, does it
> affect indexing speed when a lot of documents contains similar tokens?).
>
> But, I don't know. I'm using the Ruby version. That is supposed to be
> slow. Maybe the super fast C implementation should take 150ms to handle
> a document of this size? What affects indexing speed?

Hi Marcus,

I just tested this here;

    require 'lib/rferret.rb'

    include Ferret
    include Ferret::Document
    include Ferret::Index

    doc = Document.new
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "Presentationsmaterial")
    doc << Field.new(:field, "Ppt-presentationer")
    doc << Field.new(:field, "&nbsp;")
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "new item")
    doc << Field.new(:field, "new item")

    i = Index.new(:path => "index_dir")
    i << doc
    i.close

  dbalmain@ubuntu:~/workspace/ferret $ time ruby test.rb

  real    0m0.147s
  user    0m0.125s
  sys     0m0.022s

This is with the pure ruby version. If this document is taking 15
seconds then something is going wrong. Similarly the bad data should
hurt indexing speed considerably although it will make your index
larger than usual and merging will take a little longer. Could you
post a simple testcase that takes a long time for you?

Cheers,
Dave
This topic is locked and can not be replied to.