Re: Finding out all terms from search results. How?

Neville_B · June 16, 2006, 1:50am

How about something like this, where “field2” is the field you want to
collect

values = []
index.search_each(query) do |doc, score|
values.push index[doc][“field2”]
end

Neville_B · June 16, 2006, 3:59pm

Hi Neville,

It would work for a small resultset, but that is not an assumption I
would want to make. I hope there is a way to get this info from Ferret
directly.

Sergei.

Neville B. wrote:

How about something like this, where “field2” is the field you want to
collect

values = []
index.search_each(query) do |doc, score|
values.push index[doc][“field2”]
end

Neville_B · June 16, 2006, 4:47pm

Why would this only work for a small resultset? Are you looking for a
list of terms from the other field as tokenized by ferret or for just
the value you put in that field during indexing?

-Lee

Neville_B · June 16, 2006, 5:13pm

While I don’t completely understand all contstraints, it seems as though
a
generalized version of Neville’s solution that goes through all fields
in
the document would work just fine.

i.e.
fields = []
index.search_each(query) do |doc, score|
fields += doc.all_fields
end
values = fields.collect { |f| f.string_value }

I don’t really know what part of ‘Ferret doing this’ would be … the
information would have to be stored and retrieved from the index. Please
elaborate if we do not seem to completely understand the problem.

Neville_B · June 16, 2006, 6:45pm

Let me illustrate my problem a bit more.

There is an index with 1.2M books in it. Every book has category field
and every book can be currently in stock, which is stored in stock
field. Now, I generally expect to have 50-60% of books to be stocked. So
it leaves me with 600,000 books I would need to iterate to find out what
categories are currently stocked.

It sounds like borderline task where one would think a database would be
more appropriate, but ability to do advanced search over this collection
of books is a top priority and database would not provide that.

–
Sergei S.
Red Leaf Software LLC
web: http://redleafsoft.com

Neville_B · June 16, 2006, 6:51pm

I would think that it can provide a set of terms that are connected to a
set of documents without pulling out those documents one by one.

–
Sergei S.
Red Leaf Software LLC
web: http://redleafsoft.com

Jeremy Bensley wrote:
I don’t really know what part of ‘Ferret doing this’ would be … the
information would have to be stored and retrieved from the index. Please
elaborate if we do not seem to completely understand the problem.

Neville_B · June 16, 2006, 6:55pm

I’m not familiar enough with Ferret, but I do this sort filtering and
set intersections with Java Lucene, primarily using Solr, from a Ruby
on Rails front-end.

I build up bit sets (using Solr’s new OpenBitSet class) that
represent “all items collected” and apply that filter to searches and
also intersect (using bit set ANDing) with other sets such as “all
objects from 1861” and “all poetry genre objects”, and so on. I’ve
also customized Solr to return back facet counts, so given your
example it could show how many books were in stock in each category
and allow you to filter to see all those books easily too. Using
these types of set intersection operations even bypasses the
traditional Lucene search by simply dealing with efficiently
structure sets of document id’s.

Erik

Neville_B · June 16, 2006, 9:08pm

Thank you Erik. It is not clear to me what it would look like in Ferret,
but it sounds like a good direction to dig in.

Erik H. wrote:
I’m not familiar enough with Ferret, but I do this sort filtering and
set intersections with Java Lucene, primarily using Solr, from a Ruby
on Rails front-end.

I build up bit sets (using Solr’s new OpenBitSet class) that
represent “all items collected” and apply that filter to searches and
also intersect (using bit set ANDing) with other sets such as “all
objects from 1861” and “all poetry genre objects”, and so on. I’ve
also customized Solr to return back facet counts, so given your
example it could show how many books were in stock in each category
and allow you to filter to see all those books easily too. Using
these types of set intersection operations even bypasses the
traditional Lucene search by simply dealing with efficiently
structure sets of document id’s.

Erik

Neville_B · June 17, 2006, 2:29am

On 6/17/06, Erik H. [email protected] wrote:

     if (term == null || !term.field().equals(field)) break;
   }
Ferret has a comparable API underneath that should make this sort of
thing feasible in pure Ruby somehow.

It is similar in Ferret. Have a look here to see the solution to a
similar problem;

http://www.ruby-forum.com/topic/56232#40931

Hope that helps.

Cheers,
Dave

Neville_B · June 16, 2006, 10:47pm

On Jun 16, 2006, at 3:08 PM, Sergei S. wrote:

Thank you Erik. It is not clear to me what it would look like in
Ferret,
but it sounds like a good direction to dig in.

In Java, building up such filters is done with code like this:

   TermEnum termEnum = reader.terms(new Term(field, ""));
   while (true) {
     Term term = termEnum.term();
     if (term == null || !term.field().equals(field)) break;

     termDocs.seek(term);
     OpenBitSet bitSet = new OpenBitSet(reader.numDocs());
     while (termDocs.next()) {
       bitSet.set(termDocs.doc());
     }

     // ... cache bitSet for future use ...

     if (! termEnum.next()) break;
   }

Ferret has a comparable API underneath that should make this sort of
thing feasible in pure Ruby somehow.

Erik