Category Number Results returned

I am looking to have a number of categories populated from my results of
a search. For example, searching on “sport” would display all results
for sport. I want to also have a number of categories to refine the
documents down. So by clicking on the “Fishing” category or the
“Shooting” category, I would only see the results on sport around that
category.

Now for the fun. I want to determine the total number of results in each
category for a give search. So in the above, for a search on sport I
want to display the results but in the Fishing item I want to say how
many results there are in total before the user clicks on the item. For
example in the pull down I want to display “Fishing (10001), Shooting
(2003)”.

I was going to do this in Ruby by doing a simple count for each category
item on the returned result set, but I believe that this would mean
returning all the results of a given query to Ruby in order to do this
count and I am concerned that this would cause performance issues for
large result sets.

If I put pagination into the mix and only display the first 50 results
on the screen, would this add an additional complexity or would this
just be called through Ruby?

Thanks for your assistance with this…

On 7/10/06, BlueJay [email protected] wrote:

many results there are in total before the user clicks on the item. For
example in the pull down I want to display “Fishing (10001), Shooting
(2003)”.

Hi Clare,
The fastest way to do this would be to run the query multiple times.
So for your “sport” example you’d do something like this;

fishing_count = index.search_each("sport AND fishing", :num_docs => 
  1. {}
    shooting_count = index.search_each(“sport AND shooting”, :num_docs
    => 1) {}

    etc.

Then go ahead and paginate your query as you usually would.

I was going to do this in Ruby by doing a simple count for each category
item on the returned result set, but I believe that this would mean
returning all the results of a given query to Ruby in order to do this
count and I am concerned that this would cause performance issues for
large result sets.

Quite possibly. But running the query multiple times should be fine in
terms of performance. You could use filters instead of the code I
demonstrated above to further improve performance.

If I put pagination into the mix and only display the first 50 results
on the screen, would this add an additional complexity or would this
just be called through Ruby?

Thanks for your assistance with this…

I’m not exactly sure what you mean here when you say “would this be
called through ruby”. I hope I’ve already answered your question. Let
me know if I didn’t.

Cheers,
Dave

David B. wrote:

On 7/10/06, BlueJay [email protected] wrote:

many results there are in total before the user clicks on the item. For
example in the pull down I want to display “Fishing (10001), Shooting
(2003)”.

Hi Clare,
The fastest way to do this would be to run the query multiple times.
So for your “sport” example you’d do something like this;

fishing_count = index.search_each("sport AND fishing", :num_docs => 
  1. {}
    shooting_count = index.search_each(“sport AND shooting”, :num_docs
    => 1) {}

    etc.

Then go ahead and paginate your query as you usually would.

Thank you very much for your quite response.

I have several sub categories (taxonomy really) and what I was thinking
of doing was this in 2 queries. Index the data as per normal so that you
can do the full text search but also index the structure of the taxonomy
and have each branch contain the records that contain it.
Run one big search over the fulltext to get the list of hits and then
use this list as a query against the second index to get all the
category bits.

This would be a big query though - although it should be quick but I
would need to re-index the category bits everytime a document was added.

Does this make sense and/or would it make sense in Ferret. I have done
this before in another search engine that required special category
manipulation but never with Ferret and not sure how to go about doing
this in Ferret.

I am not sure about your idea around filtering the results

I was going to do this in Ruby by doing a simple count for each category
item on the returned result set, but I believe that this would mean
returning all the results of a given query to Ruby in order to do this
count and I am concerned that this would cause performance issues for
large result sets.

Quite possibly. But running the query multiple times should be fine in
terms of performance. You could use filters instead of the code I
demonstrated above to further improve performance.

If I put pagination into the mix and only display the first 50 results
on the screen, would this add an additional complexity or would this
just be called through Ruby?

Thanks for your assistance with this…

I’m not exactly sure what you mean here when you say “would this be
called through ruby”. I hope I’ve already answered your question. Let
me know if I didn’t.

Cheers,
Dave

On 7/10/06, BlueJay [email protected] wrote:

fishing_count = index.search_each("sport AND fishing", :num_docs =>

I have several sub categories (taxonomy really) and what I was thinking
of doing was this in 2 queries. Index the data as per normal so that you
can do the full text search but also index the structure of the taxonomy
and have each branch contain the records that contain it.
Run one big search over the fulltext to get the list of hits and then
use this list as a query against the second index to get all the
category bits.

I’m not sure what you mean by “category bits”. Can you possible
implement the categories like this;

sport/
sport/shooting/
sport/fishing/
sport/fishing/fly
sprot/fishing/deep_sea
etc.

Then, lets say you have a query in query_str. You can get all results
in the sport category like this;

index.search_each(query_str + "AND category:sport/*") {
    # ...
}

You can get all results in the fishing category like this;

index.search_each(query_str + "AND category:sport/fishing/*") {
    # ...
}

Am I making sense?

This would be a big query though - although it should be quick but I
would need to re-index the category bits everytime a document was added.

You’ve lost me. Could you give some example code?

Does this make sense and/or would it make sense in Ferret. I have done
this before in another search engine that required special category
manipulation but never with Ferret and not sure how to go about doing
this in Ferret.

I am not sure about your idea around filtering the results

I’ll explain filtering once I understand better what it is you are
trying to do.

Cheers,
Dave

David B. wrote:

David

Thanks for your continued help and assistance.

I don’t have code at this stage because I started writing it one way and
realised that the way I was writing it through counts in Ruby would not
work because of pagination.

A little more background is in order. The user will be presented with a
pull down menu with 5 selections in a main category. Doing 6 queries
(one main query) and 5 count queries in this instance is not a problem.
The problem arises when they select one of these categories.

They will then be presented with up to 5 other category structures. One
would be new or old, another would be type (up to 5 nodes), another
would be, for example, book type (such as fiction, no fiction,
authbiography) etc. (up to 20 categories), another could have up to 40
categories. The user is free to select any of these category nodes
because they may be interested in old books and fiction. I will
therefore have to populate all of the nodes with the number of documents
in each node. This could leave me with spawing 60 odd queries to count
the number of documents in each node. Subsequent selections of nodes
would refine the result set down further.

What I really would like to do is 2 or 3 queries. One which does the
normal search over the document set (collection) and the second to
populate each node in the classification structure with the number of
documents that match each node.

It is pretty easy in 2 queries to tell if there are any documents in
each node but doing a count over all the nodes is more tricky. I was
originally going to have another table which had a row for each node
with the name of the node (and structure) in one field and the
document_id’s in another field. For example, [Fishing, “doc1 doc2 doc3
doc4”], [Fishing/Fiction, “doc2, doc3”], [Fishing/Non Fiction, "doc 1]
etc. I would then get a result set that provided all the categories that
had hits against a given query. However, it does not provide the number
of documents against each node. So I could not populate the pull down
categories with Fishing (2), Fiction (1), Non Fiction (1) etc.

Therefore, what I really need is a function that will return the number
of documents in each node of a given classification structure. An
addition to the Num_Docs capability already available perhaps.

I could easily produce a results set that would be like this…

Fishing doc1
Fishing doc2
Fishing/Fiction doc3
Fishing/Fiction doc1
Fishing/Non Fiction doc4
etc…

Num_Docs would provide 5 in this instance but what I really want is:
Fishing 2
Fishing/Fiction 2
Fishing/Non Fiction 1
etc…

All that, and done in 1 or 2 queries over and above the original
search… Simple eh!

I hope that I have not confused you to much, but this is something that
I desperately need or my project is kaput!

I found this:
http://www.mail-archive.com/[email protected]/msg00343.html and

http://www.ruby-forum.com/topic/56232#40931

Do you think that this is the way to go?

Thanks very much.

On 7/10/06, BlueJay [email protected] wrote:

fishing_count = index.search_each("sport AND fishing", :num_docs =>

I have several sub categories (taxonomy really) and what I was thinking
of doing was this in 2 queries. Index the data as per normal so that you
can do the full text search but also index the structure of the taxonomy
and have each branch contain the records that contain it.
Run one big search over the fulltext to get the list of hits and then
use this list as a query against the second index to get all the
category bits.

I’m not sure what you mean by “category bits”. Can you possible
implement the categories like this;

sport/
sport/shooting/
sport/fishing/
sport/fishing/fly
sprot/fishing/deep_sea
etc.

Then, lets say you have a query in query_str. You can get all results
in the sport category like this;

index.search_each(query_str + "AND category:sport/*") {
    # ...
}

You can get all results in the fishing category like this;

index.search_each(query_str + "AND category:sport/fishing/*") {
    # ...
}

Am I making sense?

This would be a big query though - although it should be quick but I
would need to re-index the category bits everytime a document was added.

You’ve lost me. Could you give some example code?

Does this make sense and/or would it make sense in Ferret. I have done
this before in another search engine that required special category
manipulation but never with Ferret and not sure how to go about doing
this in Ferret.

I am not sure about your idea around filtering the results

I’ll explain filtering once I understand better what it is you are
trying to do.

Cheers,
Dave

David B. wrote:

I think I finally understand what you want now and I do think this is
the way to go. What you will need to do is build BitVectors for each
of your categories and sub-categories using the examples in those
those threads. Or you could just use a QueryFilter.

filter = QueryFilter.new(PrefixQuery.new(:category, "fishing")))
fishing_bits = filter.bits(index_reader)

filter = QueryFilter.new(PrefixQuery.new(:category, 

“fishing/fiction”)))
fishing_fiction_bits = filter.bits(index_reader)

filter = QueryFilter.new(PrefixQuery.new(:category, 

“fishing/nonfiction”)))
fishing_nonfiction_bits = filter.bits(index_reader)

This assumes that everything in fishing/fiction is also in fishing/.
In your example, it doesn’t seem to be the case, so you should use a
TermQuery instead of a PrefixQuery.

Now you just need to run your search the same way. Something like this;

query = query_parser.parse(query_str)
query_bits = QueryFilter.new(query).bits(index_reader)

And now you can get your counts like this;

fishing_count = (fishing_bits & query_bits).count
fishing_fiction_count = (fishing_fiction_bits & query_bits).count
fishing_nofiction_count = (fishing_nonfiction_bits & 

query_bits).count

Sadly, this code only works in theory since I haven’t release the code
that &s bit vectors yet and I used the new style PrefixQuery
declarations so they won’t work either. But if this solution seems
like it will work for you and you can wait a week, you’ll be set.

Cheers,
Dave

Dave

Thanks very much for this and I can wait a week for this to be released.
I am sorry if I was not clear about this but everything in the sub
categories will have to be in the category above as this is the way that
the system is designed.

Fishing contains documents A B C D E F G H I J
Fishing_Fiction contains A B C
Fishing_Non_Fiction contains D and E
Fishing_Fiction_New contains A B
Fishing_Fiction_Old contains C
etc.

I am assuming that I need to still wait in this case? I will try and
understand this in more detail in the meantime.

Thanks once again for all your assistance.

On 7/11/06, BlueJay [email protected] wrote:

“fishing/fiction”)))
Now you just need to run your search the same way. Something like this;

Thanks very much for this and I can wait a week for this to be released.

Great. A word or warning though, it’s all new code and you’ll be
riding on the bleeding edge. But hopefully it will stabalize quickly.
I’m working on this full time at the moment (when I’m not answering
emails ;-)).

Cheers,
Dave

David B. wrote:

On 7/11/06, BlueJay [email protected] wrote:

“fishing/fiction”)))
Now you just need to run your search the same way. Something like this;

Thanks very much for this and I can wait a week for this to be released.

Great. A word or warning though, it’s all new code and you’ll be
riding on the bleeding edge. But hopefully it will stabalize quickly.
I’m working on this full time at the moment (when I’m not answering
emails ;-)).

Cheers,
Dave

Dave

One last thought on this… because this will be new code… originally
I was going to write the count as a client side piece of code to count
the documents in each category. I realised that I would have to return
the full result set in order to do this which would cause problems with
performance.

If I were to write this as a server side script, outside of ferret, I
believe that I could achieve the same result as in your example. Can
you think of any gotchas that would make this a stupid idea?

Thanks

(Sorry in advance for taking this outside Ferret!)

On 7/11/06, BlueJay [email protected] wrote:

A little more background is in order. The user will be presented with a
therefore have to populate all of the nodes with the number of documents
each node but doing a count over all the nodes is more tricky. I was
of documents in each node of a given classification structure. An

I desperately need or my project is kaput!

I found this:
http://www.mail-archive.com/[email protected]/msg00343.html and

Most Popular Searches - Ferret - Ruby-Forum

Do you think that this is the way to go?

I think I finally understand what you want now and I do think this is
the way to go. What you will need to do is build BitVectors for each
of your categories and sub-categories using the examples in those
those threads. Or you could just use a QueryFilter.

filter = QueryFilter.new(PrefixQuery.new(:category, "fishing")))
fishing_bits = filter.bits(index_reader)

filter = QueryFilter.new(PrefixQuery.new(:category, 

“fishing/fiction”)))
fishing_fiction_bits = filter.bits(index_reader)

filter = QueryFilter.new(PrefixQuery.new(:category, 

“fishing/nonfiction”)))
fishing_nonfiction_bits = filter.bits(index_reader)

This assumes that everything in fishing/fiction is also in fishing/.
In your example, it doesn’t seem to be the case, so you should use a
TermQuery instead of a PrefixQuery.

Now you just need to run your search the same way. Something like this;

query = query_parser.parse(query_str)
query_bits = QueryFilter.new(query).bits(index_reader)

And now you can get your counts like this;

fishing_count = (fishing_bits & query_bits).count
fishing_fiction_count = (fishing_fiction_bits & query_bits).count
fishing_nofiction_count = (fishing_nonfiction_bits & 

query_bits).count

Sadly, this code only works in theory since I haven’t release the code
that &s bit vectors yet and I used the new style PrefixQuery
declarations so they won’t work either. But if this solution seems
like it will work for you and you can wait a week, you’ll be set.

Cheers,
Dave

David B. wrote:

On 7/12/06, BlueJay [email protected] wrote:

I’m working on this full time at the moment (when I’m not answering
the full result set in order to do this which would cause problems with
performance.

If I were to write this as a server side script, outside of ferret, I
believe that I could achieve the same result as in your example. Can
you think of any gotchas that would make this a stupid idea?

If you mean grab the whole result set and loop through every result
taking a running count then yes, this should work fine. I’d say my
example would be a lot faster but you never know without trying it.

Dave

Again, many thanks for replying to my queries. I may go ahead and
implement it this way just to see it working and then when your code is
available implement it that way. It will give us the opportunity to
compare but my suspicion is that the larger the dataset the faster your
approach will be…

Would it be possible to ping me when your code is available?

Thanks

On 7/12/06, BlueJay [email protected] wrote:

I’m working on this full time at the moment (when I’m not answering
the full result set in order to do this which would cause problems with
performance.

If I were to write this as a server side script, outside of ferret, I
believe that I could achieve the same result as in your example. Can
you think of any gotchas that would make this a stupid idea?

If you mean grab the whole result set and loop through every result
taking a running count then yes, this should work fine. I’d say my
example would be a lot faster but you never know without trying it.

Dave

On 7/12/06, Guest [email protected] wrote:

If you mean grab the whole result set and loop through every result

Would it be possible to ping me when your code is available?

Sure. There will be an announcement on the this mailing list as well
as the ruby and rails lists.

Dave