On 9/8/06, Clare [email protected] wrote:
how this would perform.
What I am seeing is CPU hungry search but not memory hungry. This makes
sense to me.
Q - I have test data set up in my tests that has some random junk in and
then a word such as “fish” at the end of it. I am starting to think that
I may have set up the test data wrong and should use a lot of different
words in the result set because I am sure that Ferret will cache the
search. This would give me a false impression on speed of search.
Firstly, searches don’t get cached. Only filters do. If you want to
cache the results from a query (which you would in this instance) then
you should use a QueryFilter. Secondly, I’m not sure exactly what you
are saying when you say your tests have some random junk and then the
word “fish”? If you are putting data like this into every document;
index << "asdlgkjhasd askdj asdg asdg asdg asdg lkjh asd fish"
Then you probably should work on your test data. As far as search
perfomance goes, this will be no different to doing this;
index << "fish"
What is important is to remember that TermQueries (fish) perform a lot
better than BooleanQueries (fish AND rod) and PhraseQueries (“fishing
rod”) which perform better again than WildCardQueries (fi*) so you
should try these queries too.
Here is a much better way to create random strings;
WORDS = %w{one two three}
def random_sentence(min_size, max_size)
len = min_size + rand(max_size - min_size)
sentence = []
len.times {sentence << WORDS[Math.sqrt(rand(WORDS.size *
WORDS.size))]}
sentence.join(" ")
end
10.times { puts random_sentence(10, 100) }
The Math.sqrt stuff makes sure that words aren’t evenly distributed to
be more realistic. Words appearing later in the WORDS array will be
much more common. Even better than this would be to use a copy of the
real data that you will be using though.
distributed over the results set but assuming for now that they were and
I had 500,000 records, and drilled into the second tier category
structure I would have 100,000 records in this category. I would be
doing 40 searches over 100,000 records.
Q - What do you think will perform faster in this instance?
Impossible to say without testing. Both methods are pretty simple
though so I’d try both with a variety of search strings.
I would love to have the time to build a x-dimensional memory resident
result (bucket set) that kept all the results parameterised for all the
categories, built at the initial time of the search. Would be memory
hungry but would make searching through categories and nodes and
parameters in subsequent searches lightening fast.
Would be a great addition or am I missing something?
As far as I’m concerned this functionality is already there with the
filter_proc parameter. Make it any less general than this and it isn’t
much use anymore. For example;
require 'rubygems'
require 'ferret'
include Ferret
index = I.new
words = %w{one two three four five}
100000.times do |i|
index << {:id => "%05d" % i, :word => words[rand(words.size)]}
end
groups = {}
filter_proc = lambda do |doc, score, searcher|
word = searcher[doc][:word]
(groups[word]||=[]) << doc
end
resultset = index.search("id:[09900 10000}", :limit => 1,
:filter_proc => filter_proc)
puts resultset.total_hits
puts groups.inspect
puts groups[“two”].size
I really can’t see how you could make it any easier than that.
I am really interested in the performance testing scenarios. As stated
above, I only have one word “FISH” in my test data with random made up
beforehand. e.g. “sadssderssdaatg FISH” etc.
Q - Would I be better using more words in my test data?
See above.
Also - I am interested in the round trip performance of search. The
length of time it takes from when the user clicks on search and gets the
results back. I will do this on the production server in the production
environment. My rule of thumb is that it should not take longer than 8
seconds to return the results or the user will refresh (even worse for
performance). With one user on my test system with 6 searches over
100,000 records it takes 5 seconds at the moment.
5 seconds seems like a long time. Try optimizing your index and see
how you go then. The example above took 0.028109 seconds. Personally,
I would be worried about anything taking over 1 second which was the
whole reason I wrote Ferret in C.
search goes over the 8 second limit.
Q - does anyone have any experience in this area. Even better does
anyone have a script to do this? If not, and I do write a script to do
this would this be of value to the greater community?
If I were you, I’d test plain old search performance before I tested
performance through a browser. And, again, it is pretty hard to
generalize a script like this since so many people have different
search needs. In my opinion, Ruby makes it easy enough to write this
from scratch each time.
Sorry for the long winded post. My search page and category search is
the most critical part of my site and I am anal on performance of this
because if it does not work then my site will not work.
Thanks once again for all your assistance. Sorry for any stupid or
ignorant thoughts/remarks.
Ferret rocks!
You’re welcome,
Dave