Determine how many documents a term occurs in

Is there a fast way to determine how many documents a term occurs in,
besides iterating through every document with TermDocEnum?

Hi,

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

http://ferret.davebalmain.com/api/classes/Ferret/Search/TopDocs.html

I guess you’d just need to make sure all the fuzzy search stuff is off.

or am I missing something here?

John.

On Sat, 2007-04-28 at 12:26 +0200, Stian Grytøyr wrote:

Is there a fast way to determine how many documents a term occurs in,
besides iterating through every document with TermDocEnum?


http://johnleach.co.uk

On 4/28/07, John L. [email protected] wrote:

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

Sorry, my question was imprecise. I meant how many documents in the
entire corpus (or index), not for a particular query.

Ah ok,

then IndexWriter.doc_count

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html#M000089

so something like: myindex.writer.doc_count

John.

On Sat, 2007-04-28 at 19:45 +0200, Stian Grytøyr wrote:

On 4/28/07, John L. [email protected] wrote:

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

Sorry, my question was imprecise. I meant how many documents in the
entire corpus (or index), not for a particular query.


http://johnleach.co.uk

Hi Stian,

then I’m confused, because what you’re describing is the total hits of a
one term search. You just need to watch out for fuzziness, like case
sensitivity.

But an alternative is to use the TermEnum methods, but they are done for
one field at a time:

http://ferret.davebalmain.com/api/classes/Ferret/Index/TermEnum.html

something like:

te = index_reader.terms(:content)
te.skip_to(“monkey”)
puts “The term ‘monkey’ occurs in #{te.doc_freq} documents in the index”

Am I warmer? :wink:

John.

On Sun, 2007-04-29 at 22:59 +0200, Stian Grytøyr wrote:


http://johnleach.co.uk

On 4/29/07, John L. [email protected] wrote:

then I’m confused, because what you’re describing is the total hits of a
one term search. You just need to watch out for fuzziness, like case
sensitivity.

Aha, I finally get it. I dismissed that option right away, thinking that
since I
need to look up the total number of occurences for quite a few terms for
each search, a full search for each term would become way too slow as
the
index grew. But I see now that that’s not the case, so this looks like a
good
solution.

Thanks, John!


Best regards,
Stian
Grytøyr

On 4/29/07, John L. [email protected] wrote:

then IndexWriter.doc_count

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html#M000089

so something like: myindex.writer.doc_count

Thanks, but I still don’t think we’re quite there. I’m looking for the
number
of documents (in the index) that, say, “foo” occurs in.