Index browser inconsistent with IndexReader


#1

Follow-up to my recent post with the same subject:

It seems that within the API scripting world I can view the suspect
document by instantiating and then loading the LazyDoc returned by
Ferret::Search::Searcher.get_document(doc_id). It contains the :data
field data and is perhaps what is being used by the browser.

So my question is then this: what would cause a document in an index
to have a non-empty field when looked at through a LazyDoc, but for
which no non-empty term_vector is available for the same field on the
same document?


#2

On Tue, Jun 12, 2007 at 11:46:02AM -0400, Richard J. wrote:

same document?
having the field data stored in the index does not imply that this field
is searchable. It all depends what options are set for the field (see
the FieldInfos api docs for the available options)

So it’s perfectly possible to create an index with fields f1 and f2,
where only f1 can be searched, but the contents of f2 can be shown for
search results:

fi = Ferret::Index::FieldInfos.new
fi.add_field :f1, :store => :yes, :index => :yes
fi.add_field :f2, :store => :yes, :index => :no, :term_vector => :no
i = Ferret::I.new :field_infos => fi
i << { :f1 => ‘field one’ , :f2 => ‘field two’ }

i.search ‘one’ # finds the document
i.search ‘two’ # won’t find anything

i[0][:f1] # outputs ‘field one’
i[0][:f2] # outputs ‘field two’

However that does not explain why some documents seem to have other
indexing options than the rest - maybe yo uchanged them some time
without doing a rebuild?

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#3

According to my IndexReader’s field_infos, all the fields are stored
and indexed, with :with_positions_offsets for the term_vectors.

A look at a term vector for one of these :data fields gives:

#

Is this what they look like when you index with :index=>no?


#4

On Wed, Jun 13, 2007 at 08:58:36AM -0400, Richard J. wrote:

According to my IndexReader’s field_infos, all the fields are stored
and indexed, with :with_positions_offsets for the term_vectors.

A look at a term vector for one of these :data fields gives:

#

Is this what they look like when you index with :index=>no?

no, with index => no no term vectors can be stored and then term_vector
returns nil, not an empty tv.

The scenario you have could happen if your analyzer choked at indexing
time and returned not a single term for your document (just like if you
had a doc full of stop words).

Since you have the stored contents, could you try to index that data
again and see if the problem can be reproduced?

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#5

I ran one of the :data fields through the StandardAnalyzer - the only
one we have used - and it tokenized it with no complaints.

Interestingly, the last batch of 1700 sites that we added
incrementally to our index does not seem to suffer from this problem.