Counting occurences of words in the result set

Caspar · September 7, 2006, 4:58pm

Hello, I need to be able to count the occurences of certain terms in the
reults.
Currently my setup is Ferret 0.10.1 aaf bleeding edge.

results = VoObject.find_by_contents(query,:offset=>page, :limit=>
20,:sort => sort_fields)

I use results.total_hits for pagination. This all works really nicely.
However i need to be able to know how many occurences of certain
predefined terms occur in each result set. So in the animals fields
there can be “mouse”, “cat”, “fish”.
A perfect sollution would be to have the results set has some extra
attributes like results.cat_hits (that would be amazing) In reality
there needs to be counts for 5 different fields.

So is this something that ferret can do easily?
How do i get ferret and aaf to produce this data for each search result?
What should i go and investigate?

Best reagards
caspar

Caspar · September 7, 2006, 5:11pm

On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote:

there can be “mouse”, “cat”, “fish”.
A perfect sollution would be to have the results set has some extra
attributes like results.cat_hits (that would be amazing) In reality
there needs to be counts for 5 different fields.

So is this something that ferret can do easily?
How do i get ferret and aaf to produce this data for each search result?
What should i go and investigate?

I’d first try to just issue a seperate query for each of your special
terms (ANDed with the original query), and take it’s result count.

Ideally you wouldn’t use find_by_contents for this (because it fetches
results from the db, which you don’t want here), but use something like

VoObject.ferret_index.search(query + " AND cat",…).total_hits

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Caspar · September 7, 2006, 7:55pm

Hi okay we have spent the last few hours trawling through the ferret api
and have come accross lots of promising leads, and many questions.

index_reader.doc_freq(field, term) â?? integer
Return the number of documents in which the term term appears in the
field field.

would seem to partly fit the requirmemts. However when i have tried to
instantiate a new index_reader like this
reader =
Ferret::Index::IndexReader.new("/home/c/V_O_2/index/development/vo_object/")

and then try to access some of the documents returned by search_each i
am only able to access the :id field.

Q1: how do you create an index_reader that is able to access your aaf
index?
Q2: how do you actually return the contents of a field?
Q3: How can i combine doc_freq (which seems perfect) with a search to
count the frequency of terms?

any answers would be brilliant.
best regards
caspar

Caspar · September 7, 2006, 5:44pm

Hi Jens,
Thankyou for getting back so quickly. I should have given more
information about the problem tho. One of the fields contains about 35
predefined values. I hope there is a more efficient way of producing
these counts or I may well have to drop this functionality from the app.
Any other ideas?
I really appreciate the speed with which people reply on this forum.
Regards
c

Jens K. wrote:

On Thu, Sep 07, 2006 at 04:58:08PM +0200, Caspar wrote:

there can be “mouse”, “cat”, “fish”.
A perfect sollution would be to have the results set has some extra
attributes like results.cat_hits (that would be amazing) In reality
there needs to be counts for 5 different fields.

So is this something that ferret can do easily?
How do i get ferret and aaf to produce this data for each search result?
What should i go and investigate?

I’d first try to just issue a seperate query for each of your special
terms (ANDed with the original query), and take it’s result count.

Ideally you wouldn’t use find_by_contents for this (because it fetches
results from the db, which you don’t want here), but use something like

VoObject.ferret_index.search(query + " AND cat",…).total_hits

Jens

–
webit! Gesellschaft fï¿½r neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krï¿½mer [email protected]
Schnorrstraï¿½e 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Caspar · September 7, 2006, 8:11pm

Caspar

I have been trying to get the same thing working for a while but did not
ever find a solution. It would help greatly if someone has the answer to
this because I want to add this capability to my search to provide
additional information to the user in the results page.

But I only got the :id from the index also…

Any help would be appreciated on this one.

Thanks in advance as always!

Clare

Caspar · September 8, 2006, 4:58am

On 9/8/06, Clare [email protected] wrote:

Thanks in advance as always!

By default acts_as_ferret only stores the :id. You need to set the
:store parameter of any other fields that you want stored. Something
like this;

acts_as_ferret :fields => {
:title => { :store => :yes }
:content => { :store => :yes }
}

As for counting the the frequency of terms in a resultset,
IndexReader#doc_freq probably won’t work. It will counts the frequency
of terms in the index, not in the resultset.

So back to the problem. Jens gave the solution I would probably use.
Ferret’s searches are faster enough that this solution is quite
feasible for most indexes. Try it. You might be surprised.

The alternative is counting throught the resultset. To do this you
will need to set :limit => :all in the search_each method so you get
all results back, then iterate through each result counting the
occurances. For a huge index - slow query - small resultset this might
be faster. Also, with the new filter_proc method there is another way
you can do this without having to retrieve all results;

require 'rubygems'
require 'ferret'

include Ferret

index = I.new

words = %w{one two three four five}

100000.times do |i|
  index << {:id => "%05d" % i, :word => words[rand(words.size)]}
end

counter = Hash.new(0)
filter_proc = lambda do |doc, score, searcher|
  counter[searcher[doc][:word]] += 1
end

resultset = index.search("id:[10000 20000}", :limit => 1,

:filter_proc => filter_proc)
puts resultset.total_hits
puts counter.inspect

Hope that helps,

Dave

Caspar · September 8, 2006, 9:40am

Thanks David

I will try both options. I am infact doing some performance testing now.
I have created 100,000 search result set and it takes around 5 seconds
(end to end) on my internal server to be returned (with 1 user). I am
only doing 6 significant searches on this set. One for the main results
and one for the top level categories. This is only on my test server and
not in the larger production server and I am happy with this
performance. If however I were to do my second level category search
that has around 40 nodes in it, that would be 30 searches. I am not sure
how this would perform.

What I am seeing is CPU hungry search but not memory hungry. This makes
sense to me.

Q - I have test data set up in my tests that has some random junk in and
then a word such as “fish” at the end of it. I am starting to think that
I may have set up the test data wrong and should use a lot of different
words in the result set because I am sure that Ferret will cache the
search. This would give me a false impression on speed of search.

I will create more test data however at the weekend but my instinct is
that your method outlined above may be faster.

I have 5 top level categories and this will not change much. Depending
on the search there were be a lot more results in one category that the
rest after the initial search.

Drilling into the second level categories, the most nodes I have in a
single second level category is around 40 at the moment although this is
likely to be added to over time. The resuls again will not be normally
distributed over the results set but assuming for now that they were and
I had 500,000 records, and drilled into the second tier category
structure I would have 100,000 records in this category. I would be
doing 40 searches over 100,000 records.

Q - What do you think will perform faster in this instance?

I would love to have the time to build a x-dimensional memory resident
result (bucket set) that kept all the results parameterised for all the
categories, built at the initial time of the search. Would be memory
hungry but would make searching through categories and nodes and
parameters in subsequent searches lightening fast.

Would be a great addition or am I missing something?

I am really interested in the performance testing scenarios. As stated
above, I only have one word “FISH” in my test data with random made up
beforehand. e.g. “sadssderssdaatg FISH” etc.

Q - Would I be better using more words in my test data?

Also - I am interested in the round trip performance of search. The
length of time it takes from when the user clicks on search and gets the
results back. I will do this on the production server in the production
environment. My rule of thumb is that it should not take longer than 8
seconds to return the results or the user will refresh (even worse for
performance). With one user on my test system with 6 searches over
100,000 records it takes 5 seconds at the moment.

I am expecting a large number of concurrent searches happening. I am
defining concurrency as someone searching at the same time as another
user is either searching or waiting for the results to be returned.

Most testing tools that I can see only show you what is happening on the
server. I am interested from the users perspective.

I had a thought of setting up a script that would open a number of
browser sessions and doing random searches concurrently and hammering
the server to see when it 1) breaks search, 2) breaks something else 3)
search goes over the 8 second limit.

Q - does anyone have any experience in this area. Even better does
anyone have a script to do this? If not, and I do write a script to do
this would this be of value to the greater community?

Sorry for the long winded post. My search page and category search is
the most critical part of my site and I am anal on performance of this
because if it does not work then my site will not work.

Thanks once again for all your assistance. Sorry for any stupid or
ignorant thoughts/remarks.

Ferret rocks!

Clare

Caspar · September 8, 2006, 10:49am

On 9/8/06, Clare [email protected] wrote:

how this would perform.

What I am seeing is CPU hungry search but not memory hungry. This makes
sense to me.

Q - I have test data set up in my tests that has some random junk in and
then a word such as “fish” at the end of it. I am starting to think that
I may have set up the test data wrong and should use a lot of different
words in the result set because I am sure that Ferret will cache the
search. This would give me a false impression on speed of search.

Firstly, searches don’t get cached. Only filters do. If you want to
cache the results from a query (which you would in this instance) then
you should use a QueryFilter. Secondly, I’m not sure exactly what you
are saying when you say your tests have some random junk and then the
word “fish”? If you are putting data like this into every document;

index << "asdlgkjhasd askdj asdg asdg asdg asdg lkjh asd fish"

Then you probably should work on your test data. As far as search
perfomance goes, this will be no different to doing this;

index << "fish"

What is important is to remember that TermQueries (fish) perform a lot
better than BooleanQueries (fish AND rod) and PhraseQueries (“fishing
rod”) which perform better again than WildCardQueries (fi*) so you
should try these queries too.

Here is a much better way to create random strings;

WORDS = %w{one two three}

def random_sentence(min_size, max_size)
  len = min_size + rand(max_size - min_size)
  sentence = []
  len.times {sentence << WORDS[Math.sqrt(rand(WORDS.size *

WORDS.size))]}
sentence.join(" ")
end

10.times { puts random_sentence(10, 100) }

The Math.sqrt stuff makes sure that words aren’t evenly distributed to
be more realistic. Words appearing later in the WORDS array will be
much more common. Even better than this would be to use a copy of the
real data that you will be using though.

distributed over the results set but assuming for now that they were and
I had 500,000 records, and drilled into the second tier category
structure I would have 100,000 records in this category. I would be
doing 40 searches over 100,000 records.

Q - What do you think will perform faster in this instance?

Impossible to say without testing. Both methods are pretty simple
though so I’d try both with a variety of search strings.

I would love to have the time to build a x-dimensional memory resident
result (bucket set) that kept all the results parameterised for all the
categories, built at the initial time of the search. Would be memory
hungry but would make searching through categories and nodes and
parameters in subsequent searches lightening fast.

Would be a great addition or am I missing something?

As far as I’m concerned this functionality is already there with the
filter_proc parameter. Make it any less general than this and it isn’t
much use anymore. For example;

require 'rubygems'
require 'ferret'

include Ferret
index = I.new

words = %w{one two three four five}
100000.times do |i|
  index << {:id => "%05d" % i, :word => words[rand(words.size)]}
end

groups = {}

filter_proc = lambda do |doc, score, searcher|
  word = searcher[doc][:word]
  (groups[word]||=[]) << doc
end

resultset = index.search("id:[09900 10000}", :limit => 1,

:filter_proc => filter_proc)
puts resultset.total_hits
puts groups.inspect
puts groups[“two”].size

I really can’t see how you could make it any easier than that.

I am really interested in the performance testing scenarios. As stated
above, I only have one word “FISH” in my test data with random made up
beforehand. e.g. “sadssderssdaatg FISH” etc.

Q - Would I be better using more words in my test data?

See above.

Also - I am interested in the round trip performance of search. The
length of time it takes from when the user clicks on search and gets the
results back. I will do this on the production server in the production
environment. My rule of thumb is that it should not take longer than 8
seconds to return the results or the user will refresh (even worse for
performance). With one user on my test system with 6 searches over
100,000 records it takes 5 seconds at the moment.

5 seconds seems like a long time. Try optimizing your index and see
how you go then. The example above took 0.028109 seconds. Personally,
I would be worried about anything taking over 1 second which was the
whole reason I wrote Ferret in C.

search goes over the 8 second limit.

Q - does anyone have any experience in this area. Even better does
anyone have a script to do this? If not, and I do write a script to do
this would this be of value to the greater community?

If I were you, I’d test plain old search performance before I tested
performance through a browser. And, again, it is pretty hard to
generalize a script like this since so many people have different
search needs. In my opinion, Ruby makes it easy enough to write this
from scratch each time.

Sorry for the long winded post. My search page and category search is
the most critical part of my site and I am anal on performance of this
because if it does not work then my site will not work.

Thanks once again for all your assistance. Sorry for any stupid or
ignorant thoughts/remarks.

Ferret rocks!

You’re welcome,

Dave