Forum: Ferret Most Popular Searches

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Tom D. (Guest)
on 2006-02-28 15:14
(Received via mailing list)
Hi,

I have an index where each document contains an untokenized 'url'
field.  I would like to query the index for the most popular urls.  In
SQL I would do this via a Group By clause.  Is there anything in
Ferret that will do something similar?

I found this discussion that proposed a solution involving TermEnums:

http://www.gossamer-threads.com/lists/lucene/java-...

But I noticed the IndexReader.terms and IndexReader.term_docs are not
implemented.  Is that solution the way to go?  Would an index-only
solution perform a lot faster than a pure database solution using a
group by clause?

Any feedback is appreciated.

Tom
David B. (Guest)
on 2006-03-01 01:59
(Received via mailing list)
On 2/28/06, Tom D. <removed_email_address@domain.invalid> wrote:
>
> But I noticed the IndexReader.terms and IndexReader.term_docs are not
> implemented.  Is that solution the way to go?  Would an index-only
> solution perform a lot faster than a pure database solution using a
> group by clause?

Hi Tom,

Those methods are implemented. Just not in IndexReader. They're
implemented in SegmentReader and MultiReader. IndexReader is an
abstract class. Whenever you call IndexReader#open you'll get either a
SegmentReader or a MultiReader.

Anyway, if you want to run searches on all documents with the url
field you could use a filter like this;

module Ferret::Search
  # A Filter that restricts search results to only those documents with
a
  # certain field called @group_name.
  class GroupFilter < Filter
    include Ferret::Index

    def initialize(group_name)
      @group_name = group_name
    end

    # Returns a BitVector with true for documents which should be
permitted in
    # search results, and false for those that should not.
    def bits(reader)
      bits = Ferret::Utils::BitVector.new()
      term_enum = reader.terms_from(Term.new(@group_name, ""))

      begin
        if (term_enum.term() == nil)
          return bits
        end
        term_docs = reader.term_docs
        begin
          begin
            term = term_enum.term()
            break if (term.nil? or term.field != @group_name)

            term_docs.seek(term_enum)
            while term_docs.next?
              bits.set(term_docs.doc)
            end
          end while term_enum.next?
        ensure
          term_docs.close()
        end
      ensure
        term_enum.close()
      end

      return bits
    end
  end
end

Or perhaps you only want the 10 most popular urls and you'd like to
create the filter like this;

filter = Filter.new("url", ["url1", "url2", ..., "url10"])

This filter might look something like this;

module Ferret::Search
  # A Filter that restricts search results to only those documents with
a
  # certain field called @field_name with values in the @values array.
  class GroupFilter < Filter
    include Ferret::Index

    def initialize(field_name, values)
      @field_name = field_name
      @values = values
    end

    # Returns a BitVector with true for documents which should be
permitted in
    # search results, and false for those that should not.
    def bits(reader)
      bits = Ferret::Utils::BitVector.new()
      term_enum = reader.terms_from(Term.new(@field_name, ""))

      begin
        if (term_enum.term() == nil)
          return bits
        end
        term_docs = reader.term_docs
        begin
          begin
            term = term_enum.term()
            break if (term.nil? or term.field != @field_name)

            if @values.index(term.text)
              term_docs.seek(term_enum)
              while term_docs.next?
                bits.set(term_docs.doc)
              end
            end
          end while term_enum.next?
        ensure
          term_docs.close()
        end
      ensure
        term_enum.close()
      end

      return bits
    end
  end
end

WARNING:: I haven't tested any of this code. Also, I don't know how it
would perform compared to using a group_by on the database itself
although I'd be happy to hear about any performance tests you might
do. I hope this helps.

Cheers,
Dave
Tom D. (Guest)
on 2006-03-01 14:12
(Received via mailing list)
Wow, thanks for taking the time to put that together Dave.  That looks
very promising.

I appreciate it.  If I have a chance to do performance tests, I will
report back to this list.

Tom
This topic is locked and can not be replied to.