On 2/28/06, Tom D. [email protected] wrote:
But I noticed the IndexReader.terms and IndexReader.term_docs are not
implemented. Is that solution the way to go? Would an index-only
solution perform a lot faster than a pure database solution using a
group by clause?
Hi Tom,
Those methods are implemented. Just not in IndexReader. They’re
implemented in SegmentReader and MultiReader. IndexReader is an
abstract class. Whenever you call IndexReader#open you’ll get either a
SegmentReader or a MultiReader.
Anyway, if you want to run searches on all documents with the url
field you could use a filter like this;
module Ferret::Search
A Filter that restricts search results to only those documents with
a
certain field called @group_name.
class GroupFilter < Filter
include Ferret::Index
def initialize(group_name)
@group_name = group_name
end
# Returns a BitVector with true for documents which should be
permitted in
# search results, and false for those that should not.
def bits(reader)
bits = Ferret::Utils::BitVector.new()
term_enum = reader.terms_from(Term.new(@group_name, “”))
begin
if (term_enum.term() == nil)
return bits
end
term_docs = reader.term_docs
begin
begin
term = term_enum.term()
break if (term.nil? or term.field != @group_name)
term_docs.seek(term_enum)
while term_docs.next?
bits.set(term_docs.doc)
end
end while term_enum.next?
ensure
term_docs.close()
end
ensure
term_enum.close()
end
return bits
end
end
end
Or perhaps you only want the 10 most popular urls and you’d like to
create the filter like this;
filter = Filter.new(“url”, [“url1”, “url2”, …, “url10”])
This filter might look something like this;
module Ferret::Search
A Filter that restricts search results to only those documents with
a
certain field called @field_name with values in the @values array.
class GroupFilter < Filter
include Ferret::Index
def initialize(field_name, values)
@field_name = field_name
@values = values
end
# Returns a BitVector with true for documents which should be
permitted in
# search results, and false for those that should not.
def bits(reader)
bits = Ferret::Utils::BitVector.new()
term_enum = reader.terms_from(Term.new(@field_name, “”))
begin
if (term_enum.term() == nil)
return bits
end
term_docs = reader.term_docs
begin
begin
term = term_enum.term()
break if (term.nil? or term.field != @field_name)
if @values.index(term.text)
term_docs.seek(term_enum)
while term_docs.next?
bits.set(term_docs.doc)
end
end
end while term_enum.next?
ensure
term_docs.close()
end
ensure
term_enum.close()
end
return bits
end
end
end
WARNING:: I haven’t tested any of this code. Also, I don’t know how it
would perform compared to using a group_by on the database itself
although I’d be happy to hear about any performance tests you might
do. I hope this helps.
Cheers,
Dave