Re: Possiible Bug ? indexWriter#doc_count countsdeleted docs

Hi David,

Deleted documents don’t get deleted until commit is called

Ok, but FYI, my experiments show that #commit doesn’t affect #doc_count,
even across ruby sessions.

On a different note, I’d like to request a variation of #add_document
which returns the doc_id of the document added, as opposed to self.

I’m trying to track down an issue with a large test index [600MB, 500k
docs] in which I need to update a document. The old document is deleted
then added again, but doesn’t show up in my searches.

A #doc_count on the writer before and after #add_document shows that the
index is 1 document larger, but I still cant #search for the updated
doc.

What do you think about having #add_document “yield” the doc_id if
block_given?

Neville

On 9/14/06, Neville B. [email protected] wrote:

Hi David,

Deleted documents don’t get deleted until commit is called

Ok, but FYI, my experiments show that #commit doesn’t affect #doc_count,
even across ruby sessions.

Sorry, I guess I wan’t very clear on that point. The deletes don’t get
commited until commit is called which is why I don’t have a num_docs
method in IndexWriter to because there is no way to reliably tell
until commit is called. IndexWriter#doc_count is like
IndexReader#max_doc. It tells you how many documents there are in the
index, deleted or not.

What do you think about having #add_document “yield” the doc_id if
block_given?

Neville

How about just using the doc_count method. Call it after you add the
document and subtract one and you’ll have the document ID of the last
document added. Don’t call it before you add the document as a merge
might happen when you add the document, possibly changing all document
IDs when deletes are completely removed.

Cheers,
Dave

On 9/14/06, David B. [email protected] wrote:

method in IndexWriter to because there is no way to reliably tell

document and subtract one and you’ll have the document ID of the last
document added. Don’t call it before you add the document as a merge
might happen when you add the document, possibly changing all document
IDs when deletes are completely removed.

Cheers,
Dave

I should also mention the reason I wouldn’t want to return the
document ID from any IndexWriter method is that the document ID could
become invalid when the next document is added (if a segment merge is
triggered and deletes exist). At least when using an IndexReader, the
document ID is valid for the life of the reader.