Forum: Ferret Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
2d9c831caaf8eb6a202d598d7e025535?d=identicon&s=25 William Mitchell (trucker)
on 2006-06-02 21:16
Ferret 0.9.3
Ruby 1.8.2
NOT storing file contents in the index.
Only indexing first 25k of each file.
Very large data set (1 million files, 350 Gb)
Code based on snippet from David Balmain's forum posts.

After 6 hours, Ferret bails out with Ruby "exceeds max file size".

Cache:

-rw-r--r--  1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp
-rw-r--r--  1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx
-rw-r--r--  1 bill bill  646302802 2006-06-01 22:42 _ntc6.frq
-rw-r--r--  1 bill bill  165561698 2006-06-01 22:42 _ntc6.tis
-rw-r--r--  1 bill bill   50541430 2006-06-01 22:14 _ntc6.fdt
-rw-r--r--  1 bill bill    8000000 2006-06-01 22:14 _ntc6.fdx
-rw-r--r--  1 bill bill    2097842 2006-06-01 22:42 _ntc6.tii
-rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f0
-rw-r--r--  1 bill bill    1000000 2006-06-01 22:42 _ntc6.f1
-rw-r--r--  1 bill bill         30 2006-06-01 22:42 segments
-rw-r--r--  1 bill bill         16 2006-06-01 22:14 _ntc6.fnm

Code:

#------------

index = Index::Index.new(:path => "/var/cache/ferrets")

max_file_length = 25000

Dir.glob(allfiles).each do
  |file|
  doc = Document::Document.new()
  doc << Document::Field.new(:file, file,
                             Document::Field::Store::YES,
                             Document::Field::Index::UNTOKENIZED)
  doc << Document::Field.new(:content, IO.read(file, max_file_length),
                             Document::Field::Store::NO,
                             Document::Field::Index::TOKENIZED)
  index << doc
end

#------------

Is there a workaround, or is this exceeding Ferret's limits?

Thanks!  By the way, retrieval is usably fast for my purposes, even on a
big index like this.  Very impressive.
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2006-06-03 01:50
(Received via mailing list)
On 6/3/06, William Mitchell <wemitchell@gmail.com> wrote:
>
> -rw-r--r--  1 bill bill         16 2006-06-01 22:14 _ntc6.fnm
>   |file|
> #------------
>
> Is there a workaround, or is this exceeding Ferret's limits?

You need to set :max_merge_docs when you create the index. This will
stop the index merging segments when it gets to a certain size. This
will also mean that you will always have multiple segments in your
index which will slow things down a little but it shouldn't be a
problem. Judging by the filenames you've almost merged 1,000,000
documents by the time it fails ("ntc6".to_i(36) = 1,111,110 =
1,000,000 documents and 111,110 merges). Looks like you are pretty
close to finishing. So if you create your index like this it should
work;

    index = Index::Index.new(:path => "/var/cache/ferrets",
                             :max_merge_docs => 100_000)

This will leave you with at least 10 segments at the end. You could
also set max_merge_docs to 500_000 and run index.optimize at the end.
This should keep you under the max file size and with 2-3 segments,
searching should be easily fast enough.

As an aside, you can also set :max_field_length (default 10,000) to
limit the number of terms that get indexed from any one document
instead of truncating the file to 25,000 bytes. The will prevent you
getting a half term at the end of the document as 25,000 might break
in the middle of a word. It shouldn't effect search results too much
however so you can keep doing it this way. In a future version you'll
be able to pass a File handle instead of a string in which case it
will definitly be better to set :max_field_length.

> Thanks!  By the way, retrieval is usably fast for my purposes, even on a
> big index like this.  Very impressive.

Thanks. Please let me know how it goes. This is possibly the largest
document set to be indexed with Ferret so far.

Cheers,
Dave
This topic is locked and can not be replied to.