Most Popular Searches

Hi,

I have an index where each document contains an untokenized ‘url’
field. I would like to query the index for the most popular urls. In
SQL I would do this via a Group By clause. Is there anything in
Ferret that will do something similar?

I found this discussion that proposed a solution involving TermEnums:

http://www.gossamer-threads.com/lists/lucene/java-user/32272#32272

But I noticed the IndexReader.terms and IndexReader.term_docs are not
implemented. Is that solution the way to go? Would an index-only
solution perform a lot faster than a pure database solution using a
group by clause?

Any feedback is appreciated.

Tom

Wow, thanks for taking the time to put that together Dave. That looks
very promising.

I appreciate it. If I have a chance to do performance tests, I will
report back to this list.

Tom

On 2/28/06, Tom D. [email protected] wrote:

But I noticed the IndexReader.terms and IndexReader.term_docs are not
implemented. Is that solution the way to go? Would an index-only
solution perform a lot faster than a pure database solution using a
group by clause?

Hi Tom,

Those methods are implemented. Just not in IndexReader. They’re
implemented in SegmentReader and MultiReader. IndexReader is an
abstract class. Whenever you call IndexReader#open you’ll get either a
SegmentReader or a MultiReader.

Anyway, if you want to run searches on all documents with the url
field you could use a filter like this;

module Ferret::Search

A Filter that restricts search results to only those documents with

a

certain field called @group_name.

class GroupFilter < Filter
include Ferret::Index

def initialize(group_name)
  @group_name = group_name
end

# Returns a BitVector with true for documents which should be 

permitted in
# search results, and false for those that should not.
def bits(reader)
bits = Ferret::Utils::BitVector.new()
term_enum = reader.terms_from(Term.new(@group_name, “”))

  begin
    if (term_enum.term() == nil)
      return bits
    end
    term_docs = reader.term_docs
    begin
      begin
        term = term_enum.term()
        break if (term.nil? or term.field != @group_name)

        term_docs.seek(term_enum)
        while term_docs.next?
          bits.set(term_docs.doc)
        end
      end while term_enum.next?
    ensure
      term_docs.close()
    end
  ensure
    term_enum.close()
  end

  return bits
end

end
end

Or perhaps you only want the 10 most popular urls and you’d like to
create the filter like this;

filter = Filter.new(“url”, [“url1”, “url2”, …, “url10”])

This filter might look something like this;

module Ferret::Search

A Filter that restricts search results to only those documents with

a

certain field called @field_name with values in the @values array.

class GroupFilter < Filter
include Ferret::Index

def initialize(field_name, values)
  @field_name = field_name
  @values = values
end

# Returns a BitVector with true for documents which should be 

permitted in
# search results, and false for those that should not.
def bits(reader)
bits = Ferret::Utils::BitVector.new()
term_enum = reader.terms_from(Term.new(@field_name, “”))

  begin
    if (term_enum.term() == nil)
      return bits
    end
    term_docs = reader.term_docs
    begin
      begin
        term = term_enum.term()
        break if (term.nil? or term.field != @field_name)

        if @values.index(term.text)
          term_docs.seek(term_enum)
          while term_docs.next?
            bits.set(term_docs.doc)
          end
        end
      end while term_enum.next?
    ensure
      term_docs.close()
    end
  ensure
    term_enum.close()
  end

  return bits
end

end
end

WARNING:: I haven’t tested any of this code. Also, I don’t know how it
would perform compared to using a group_by on the database itself
although I’d be happy to hear about any performance tests you might
do. I hope this helps.

Cheers,
Dave