Determine how many documents a term occurs in

Stian_GrytSSSSyr · April 28, 2007, 12:27pm

Is there a fast way to determine how many documents a term occurs in,
besides iterating through every document with TermDocEnum?

Stian_GrytSSSSyr · April 28, 2007, 7:15pm

Hi,

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

http://ferret.davebalmain.com/api/classes/Ferret/Search/TopDocs.html

I guess you’d just need to make sure all the fuzzy search stuff is off.

or am I missing something here?

John.

On Sat, 2007-04-28 at 12:26 +0200, Stian GrytÃ¸yr wrote:

Is there a fast way to determine how many documents a term occurs in,
besides iterating through every document with TermDocEnum?

–
http://johnleach.co.uk

Stian_GrytSSSSyr · April 28, 2007, 7:46pm

On 4/28/07, John L. [email protected] wrote:

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

Sorry, my question was imprecise. I meant how many documents in the
entire corpus (or index), not for a particular query.

Stian_GrytSSSSyr · April 29, 2007, 10:39pm

Ah ok,

then IndexWriter.doc_count

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html#M000089

so something like: myindex.writer.doc_count

John.

On Sat, 2007-04-28 at 19:45 +0200, Stian GrytÃ¸yr wrote:

On 4/28/07, John L. [email protected] wrote:

when you do a search, the array you get back is actually a TopDocs
object, which has a total_hits value.

Sorry, my question was imprecise. I meant how many documents in the
entire corpus (or index), not for a particular query.

–
http://johnleach.co.uk

Stian_GrytSSSSyr · April 29, 2007, 11:51pm

Hi Stian,

then I’m confused, because what you’re describing is the total hits of a
one term search. You just need to watch out for fuzziness, like case
sensitivity.

But an alternative is to use the TermEnum methods, but they are done for
one field at a time:

http://ferret.davebalmain.com/api/classes/Ferret/Index/TermEnum.html

something like:

te = index_reader.terms(:content)
te.skip_to(“monkey”)
puts “The term ‘monkey’ occurs in #{te.doc_freq} documents in the index”

Am I warmer?

John.

On Sun, 2007-04-29 at 22:59 +0200, Stian GrytÃ¸yr wrote:

–
http://johnleach.co.uk

Stian_GrytSSSSyr · April 30, 2007, 1:24pm

On 4/29/07, John L. [email protected] wrote:

then I’m confused, because what you’re describing is the total hits of a
one term search. You just need to watch out for fuzziness, like case
sensitivity.

Aha, I finally get it. I dismissed that option right away, thinking that
since I
need to look up the total number of occurences for quite a few terms for
each search, a full search for each term would become way too slow as
the
index grew. But I see now that that’s not the case, so this looks like a
good
solution.

Thanks, John!

–
Best regards,
Stian
Grytøyr

Stian_GrytSSSSyr · April 29, 2007, 10:59pm

On 4/29/07, John L. [email protected] wrote:

then IndexWriter.doc_count

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html#M000089

so something like: myindex.writer.doc_count

Thanks, but I still don’t think we’re quite there. I’m looking for the
number
of documents (in the index) that, say, “foo” occurs in.