Indexing fails -- _ntc6.tmp exceeds 2 gigabyte maximum

Ferret 0.9.3
Ruby 1.8.2
NOT storing file contents in the index.
Only indexing first 25k of each file.
Very large data set (1 million files, 350 Gb)
Code based on snippet from David B.'s forum posts.

After 6 hours, Ferret bails out with Ruby “exceeds max file size”.

Cache:

-rw-r–r-- 1 bill bill 2147483647 2006-06-01 22:45 _ntc6.tmp
-rw-r–r-- 1 bill bill 1690862924 2006-06-01 22:42 _ntc6.prx
-rw-r–r-- 1 bill bill 646302802 2006-06-01 22:42 _ntc6.frq
-rw-r–r-- 1 bill bill 165561698 2006-06-01 22:42 _ntc6.tis
-rw-r–r-- 1 bill bill 50541430 2006-06-01 22:14 _ntc6.fdt
-rw-r–r-- 1 bill bill 8000000 2006-06-01 22:14 _ntc6.fdx
-rw-r–r-- 1 bill bill 2097842 2006-06-01 22:42 _ntc6.tii
-rw-r–r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f0
-rw-r–r-- 1 bill bill 1000000 2006-06-01 22:42 _ntc6.f1
-rw-r–r-- 1 bill bill 30 2006-06-01 22:42 segments
-rw-r–r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm

Code:

#------------

index = Index::Index.new(:path => “/var/cache/ferrets”)

max_file_length = 25000

Dir.glob(allfiles).each do
|file|
doc = Document::Document.new()
doc << Document::Field.new(:file, file,
Document::Field::Store::YES,
Document::Field::Index::UNTOKENIZED)
doc << Document::Field.new(:content, IO.read(file, max_file_length),
Document::Field::Store::NO,
Document::Field::Index::TOKENIZED)
index << doc
end

#------------

Is there a workaround, or is this exceeding Ferret’s limits?

Thanks! By the way, retrieval is usably fast for my purposes, even on a
big index like this. Very impressive.

On 6/3/06, William M. [email protected] wrote:

-rw-r–r-- 1 bill bill 16 2006-06-01 22:14 _ntc6.fnm
|file|
#------------

Is there a workaround, or is this exceeding Ferret’s limits?

You need to set :max_merge_docs when you create the index. This will
stop the index merging segments when it gets to a certain size. This
will also mean that you will always have multiple segments in your
index which will slow things down a little but it shouldn’t be a
problem. Judging by the filenames you’ve almost merged 1,000,000
documents by the time it fails (“ntc6”.to_i(36) = 1,111,110 =
1,000,000 documents and 111,110 merges). Looks like you are pretty
close to finishing. So if you create your index like this it should
work;

index = Index::Index.new(:path => "/var/cache/ferrets",
                         :max_merge_docs => 100_000)

This will leave you with at least 10 segments at the end. You could
also set max_merge_docs to 500_000 and run index.optimize at the end.
This should keep you under the max file size and with 2-3 segments,
searching should be easily fast enough.

As an aside, you can also set :max_field_length (default 10,000) to
limit the number of terms that get indexed from any one document
instead of truncating the file to 25,000 bytes. The will prevent you
getting a half term at the end of the document as 25,000 might break
in the middle of a word. It shouldn’t effect search results too much
however so you can keep doing it this way. In a future version you’ll
be able to pass a File handle instead of a string in which case it
will definitly be better to set :max_field_length.

Thanks! By the way, retrieval is usably fast for my purposes, even on a
big index like this. Very impressive.

Thanks. Please let me know how it goes. This is possibly the largest
document set to be indexed with Ferret so far.

Cheers,
Dave