Indexing problem 10.9/10.10

Sorry if this is a repost- I wasn’t sure if the www.ruby-forum.com
list works for postings.
I’ve been having trouble with indexing a large amount of
documents(2.4M).

Essentially, I have one process that is following the tutorial
dumping documents to an index stored on the file system. If I open the
index with another process, and run the size() method it is stuck at
a number of documents much smaller than the number I’ve added to the
index.

Eg. 290k – when the indexer process has already gone through 1 M.

Additionally, if I search, I don’t get results past an
even smaller number of docs (22k) . I’ve tried the two latest ferret
releases.

Does this listing of the index directory look right?

-rw------- 1 blee blee 3.8M Oct 10 17:06 _v.fdt
-rw------- 1 blee blee 51K Oct 10 17:06 _v.fdx
-rw------- 1 blee blee 12M Oct 10 16:49 _u.cfs
-rw------- 1 blee blee 97 Oct 10 16:49 fields

-rw------- 1 blee blee 78 Oct 10 16:49 segments
-rw------- 1 blee blee 11M Oct 10 16:23 _t.cfs
-rw------- 1 blee blee 11M Oct 10 15:56 _s.cfs
-rw------- 1 blee blee 15M Oct 10 15:11 _r.cfs
-rw------- 1 blee blee 13M Oct 10 14:48 _q.cfs

-rw------- 1 blee blee 14M Oct 10 14:37 _p.cfs
-rw------- 1 blee blee 13M Oct 10 14:28 _o.cfs
-rw------- 1 blee blee 12M Oct 10 14:19 _n.cfs
-rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs
-rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs

-rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs
-rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck

Thanks,
Ben

We’ve had somewhat of a similar situation ourselves, where we are
indexing
about a million records to an index, and each record can be somewhat
large.

Now…what happened on our side was that the index files (very similar in
structure to what you have below) came up to a 2 gig limit and stopped
there…and the indexer started crashing each time it hit this limit.

On your side, I don’t see your index file sizes really that large. I
think
the compiling with large file support only really kicks in when you hit
this
2 gig size limit.

Couple of thoughts that might help:

  1. On our side, to keep size down, I would optimize the index at every
    100,000 documents. The optimize call also flushes the index.

  2. Make sure you close the index once you index your data. Small
    thing…but just making sure.

  3. With the index being this large, we actually have two copies, one
    for
    searching against an already optimized index, and the other copy doing
    the
    indexing. This way, no items are being searched on while the indexing
    is
    taking place.

  4. One neat thing that I learned with indexing large items, was that I
    don’t have to actually store everything. I can have a field set to
    tokenize, but not store, so that it can be searched…but I don’t need it
    to
    be displayed in the search results per say…I don’t actually store it,
    so I
    was able to keep my index size down.

On 10/11/06, Ben L. [email protected] wrote:

Eg. 290k – when the indexer process has already gone through 1 M.
-rw------- 1 blee blee 97 Oct 10 16:49 fields
-rw------- 1 blee blee 12M Oct 10 14:16 _m.cfs
-rw------- 1 blee blee 118M Oct 10 14:10 _l.cfs

-rw------- 1 blee blee 129M Oct 10 13:24 _a.cfs
-rw------- 1 blee blee 0 Oct 10 13:00 ferret-write.lck

Thanks,
Ben

I thought this was possibly due to the fact that you didn’t have
Ferret compiled with large-file support but by the looks of it you
aren’t getting near that limit yet. In the directory listing you have
here there is no way you could have added more than 290K documents
unless you set :max_buffered_docs to a different value (> 10,000).
Perhaps the index is getting over-written at some stage. Could you
show us the code you are using for indexing?

As for search results only showing for the top 22k documents, I’m not
sure what the problem might be. You need to make sure you open the
index reader or searcher after committing the index writer, otherwise
the latest results won’t show up. I don’t think this is your problem
though as I’m sure you would have opened the index-reader much later
than after indexing 22k documents.

Cheers,
Dave

Thanks for the tips, things seem happier now. Yeah, the size of each
document (number of tokens) is actually quite small in my case - I
think this is just case of me messing up the flush/optimize/close
tactics.

On 10/11/06, Ben L. [email protected] wrote:

Thanks for the tips, things seem happier now. Yeah, the size of each
document (number of tokens) is actually quite small in my case - I
think this is just case of me messing up the flush/optimize/close
tactics.

That’s great to hear Ben.

On 10/11/06, peter [email protected] wrote:

We’ve had somewhat of a similar situation ourselves, where we are indexing
about a million records to an index, and each record can be somewhat large.

Now…what happened on our side was that the index files (very similar in
structure to what you have below) came up to a 2 gig limit and stopped
there…and the indexer started crashing each time it hit this limit.

On your side, I don’t see your index file sizes really that large. I think
the compiling with large file support only really kicks in when you hit this
2 gig size limit.

Hi Peter,
Did you manage to compile Ferret successfully with large-file support
yourself?

Couple of thoughts that might help:

  1. On our side, to keep size down, I would optimize the index at every
    100,000 documents. The optimize call also flushes the index.

You can also just call Index#flush to flush the index without having
to optimize. Or IndexWriter#commit. Actually they should both be
commit so I’m going to alias commit to flush in the Index class in the
next version.

  1. Make sure you close the index once you index your data. Small
    thing…but just making sure.

  2. With the index being this large, we actually have two copies, one for
    searching against an already optimized index, and the other copy doing the
    indexing. This way, no items are being searched on while the indexing is
    taking place.

This shouldn’t be necessary. Whatever version of the index you open
the IndexReader on will be the version of the index that you are
searching, even when it’s files are deleted it will hold on to the
file handles so the data should still be available. The operating
system won’t be able to use that disk space until you close the
IndexReader (or Searcher).

  1. One neat thing that I learned with indexing large items, was that I
    don’t have to actually store everything. I can have a field set to
    tokenize, but not store, so that it can be searched…but I don’t need it to
    be displayed in the search results per say…I don’t actually store it, so I
    was able to keep my index size down.

Very good tip. You should also set :term_vector to :no unless you are
using term-vectors.

Cheers,
Dave

Hey Dave!

Yes…we actually compiled with large-file support, and things seem to be
working just fine. And in the end, once I figured out that I can
tokenize a
large bit of text, and not have to actually store it, we were able to
have
the optimized index only be about 1 gig at the end, so large-file
support
never became an issue, even though we did compile it that way, just in
case.

With the two copies thing, we actually have two boxes in our cluster,
each
with a copy of the index used for searching, but only one copy used for
indexing. That way, each box we have in the cluster can search locally,
while the “indexing” box can index away, and update the copies when it’s
done.

Oh…and I do turn off :term_vector for most of my fields…thanks for the
tip.

By the way, thanks for all the hard work you do in getting this product
the
best it can be.