Getting non-stemmed terms from IndexReader

ted · March 4, 2007, 2:02pm

I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I’m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.

ted · March 6, 2007, 3:14am

On 3/5/07, Ted [email protected] wrote:

I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I’m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.

Hi Ted,

Unfortunately this isn’t really possible. What I’d recommend is
indexing the field twice; once with a stemming analyzer and once
without. See PerFieldAnalyzer;

http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html

Hope that helps.

Cheers,
Dave

ted · March 6, 2007, 3:46am

Thanks for the response. This is exactly what I did… indexing the
field twice and then have different analyzers for both.

David B. wrote:

On 3/5/07, Ted [email protected] wrote:

I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I’m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.

Hi Ted,

Unfortunately this isn’t really possible. What I’d recommend is
indexing the field twice; once with a stemming analyzer and once
without. See PerFieldAnalyzer;
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html
Hope that helps.

Cheers,
Dave

ted · March 6, 2007, 4:30am

On 3/6/07, Ted [email protected] wrote:

I encountered another problem:

After I removed docs from the index, the doc_freq returned by
IndexReader.terms is not updated. It always shows the old number or
bigger number after more docs with that term is added.
So it looks like the doc_freq is not updated corrected on removal of a
doc.

This is impossible to fix without ruining performance. To fix this
problem I would basically need to optimize the index after every
deletion. In fact, you can do this yourself if you like. Just optimize
the index whenever you need to rely on the doc frequency being correct
and you have possible deletions in the index.

Cheers,
Dave

ted · March 6, 2007, 3:58am

I encountered another problem:

After I removed docs from the index, the doc_freq returned by
IndexReader.terms is not updated. It always shows the old number or
bigger number after more docs with that term is added.
So it looks like the doc_freq is not updated corrected on removal of a
doc.

David B. wrote:

On 3/5/07, Ted [email protected] wrote:

I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I’m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.

Hi Ted,

Unfortunately this isn’t really possible. What I’d recommend is
indexing the field twice; once with a stemming analyzer and once
without. See PerFieldAnalyzer;
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html
Hope that helps.

Cheers,
Dave

ted · March 6, 2007, 9:24am

Got it. I had thought that ‘flush’ would do the trick, but i guess not
so. I think I will have to call optimize but do so only when necessary
then. Thanks for your response.

David B. wrote:

On 3/6/07, Ted [email protected] wrote:

I encountered another problem:

After I removed docs from the index, the doc_freq returned by
IndexReader.terms is not updated. It always shows the old number or
bigger number after more docs with that term is added.
So it looks like the doc_freq is not updated corrected on removal of a
doc.

This is impossible to fix without ruining performance. To fix this
problem I would basically need to optimize the index after every
deletion. In fact, you can do this yourself if you like. Just optimize
the index whenever you need to rely on the doc frequency being correct
and you have possible deletions in the index.

Cheers,
Dave