Hi, I have an index where each document contains an untokenized 'url' field. I would like to query the index for the most popular urls. In SQL I would do this via a Group By clause. Is there anything in Ferret that will do something similar? I found this discussion that proposed a solution involving TermEnums: http://www.gossamer-threads.com/lists/lucene/java-user/32272#32272 But I noticed the IndexReader.terms and IndexReader.term_docs are not implemented. Is that solution the way to go? Would an index-only solution perform a lot faster than a pure database solution using a group by clause? Any feedback is appreciated. Tom
on 2006-02-28 14:14
on 2006-03-01 00:59
On 2/28/06, Tom Davies <atomgiant@gmail.com> wrote: > > But I noticed the IndexReader.terms and IndexReader.term_docs are not > implemented. Is that solution the way to go? Would an index-only > solution perform a lot faster than a pure database solution using a > group by clause? Hi Tom, Those methods are implemented. Just not in IndexReader. They're implemented in SegmentReader and MultiReader. IndexReader is an abstract class. Whenever you call IndexReader#open you'll get either a SegmentReader or a MultiReader. Anyway, if you want to run searches on all documents with the url field you could use a filter like this; module Ferret::Search # A Filter that restricts search results to only those documents with a # certain field called @group_name. class GroupFilter < Filter include Ferret::Index def initialize(group_name) @group_name = group_name end # Returns a BitVector with true for documents which should be permitted in # search results, and false for those that should not. def bits(reader) bits = Ferret::Utils::BitVector.new() term_enum = reader.terms_from(Term.new(@group_name, "")) begin if (term_enum.term() == nil) return bits end term_docs = reader.term_docs begin begin term = term_enum.term() break if (term.nil? or term.field != @group_name) term_docs.seek(term_enum) while term_docs.next? bits.set(term_docs.doc) end end while term_enum.next? ensure term_docs.close() end ensure term_enum.close() end return bits end end end Or perhaps you only want the 10 most popular urls and you'd like to create the filter like this; filter = Filter.new("url", ["url1", "url2", ..., "url10"]) This filter might look something like this; module Ferret::Search # A Filter that restricts search results to only those documents with a # certain field called @field_name with values in the @values array. class GroupFilter < Filter include Ferret::Index def initialize(field_name, values) @field_name = field_name @values = values end # Returns a BitVector with true for documents which should be permitted in # search results, and false for those that should not. def bits(reader) bits = Ferret::Utils::BitVector.new() term_enum = reader.terms_from(Term.new(@field_name, "")) begin if (term_enum.term() == nil) return bits end term_docs = reader.term_docs begin begin term = term_enum.term() break if (term.nil? or term.field != @field_name) if @values.index(term.text) term_docs.seek(term_enum) while term_docs.next? bits.set(term_docs.doc) end end end while term_enum.next? ensure term_docs.close() end ensure term_enum.close() end return bits end end end WARNING:: I haven't tested any of this code. Also, I don't know how it would perform compared to using a group_by on the database itself although I'd be happy to hear about any performance tests you might do. I hope this helps. Cheers, Dave
on 2006-03-01 13:12
Wow, thanks for taking the time to put that together Dave. That looks very promising. I appreciate it. If I have a chance to do performance tests, I will report back to this list. Tom
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.