AAF Sorting by date - what am I doing wrong?

ianzabel · August 30, 2006, 7:51pm

I’m trying to sort my search results by Date, in descending order. I’ve
done quite a bit of reading through the forums here, and I’ve tried two
different suggestions.

This just returns results in the same order as a search without a sort:
sort_fields = []
sort_fields <<
Ferret::Search::SortField.new(“ferret_created_at”,:reverse => :true)
Comment.find_by_contents(“test”, :sort => sort_fields, :num_docs => 5)

This also doesn’t affect the order:
Comment.find_by_contents(“test”, :sort => [“ferret_created_at”],
:num_docs => 5)

The following, however, DOES affect the order, but it’s SUPER slow:
Ferret::Search::SortField.new(“id”,:reverse => :true)

Sorting by id desc is really all I need, so if it’s easier to somehow
quickly sort by that, all the better.

Here’s my model:
class Comment < ActiveRecord::Base
acts_as_paranoid
acts_as_ferret :fields => [ ‘comment’, :forum_id, ‘mod_type’,
‘user_id’, ‘ferret_created_at’ ]
[…]
def ferret_created_at
created_at.strftime("%Y%m%d%H%M")
end
[…]
end

Any ideas as to what I’m doing wrong, or how to get this to work?
Thanks!
Ian.

ianzabel · August 30, 2006, 11:58pm

Hi Ian,

what Versions of aaf and Ferret do you use ?

I’d suggest you try out aaf trunk and Ferret 0.10.1, there sorting seems
to work (can’t say anything about speed besides 0.10 in general being
faster). I didn’t ever use sorting with aaf, but afair some people on
this list did, so we should get this working.

I added a test case to aaf that sorts by :id
(http://projects.jkraemer.net/acts_as_ferret/browser/trunk/demo/test/unit/content_test.rb
line 94).

If sorting by :id works and another field doesn’t, maybe it’s because of
different field storage options (the docs suggest untokenized indexing
for fields you want to sort by, which is true for :id by default, but
false for other fields)

Jens

On Wed, Aug 30, 2006 at 07:51:18PM +0200, Ian Z. wrote:

This also doesn’t affect the order:
class Comment < ActiveRecord::Base

Any ideas as to what I’m doing wrong, or how to get this to work?
Thanks!
Ian.

–
Posted via http://www.ruby-forum.com/.

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

ianzabel · August 31, 2006, 10:23pm

To sort on a field, it must be stored as untokenized.

When I’m sorting on dates, I actually convert them to epoch seconds,
then sort on that integer. I’m not sure if this is really any faster
than sort on strings, but I suspect it may be.

-ryan

ianzabel · September 1, 2006, 12:02am

Ryan K. wrote:

To sort on a field, it must be stored as untokenized.

When I’m sorting on dates, I actually convert them to epoch seconds,
then sort on that integer. I’m not sure if this is really any faster
than sort on strings, but I suspect it may be.

-ryan

Thanks for the responses, guys.

Jens, I’m using aaf trunk. I also have ferret 0.10.1 installed, so I’m
assuming that aaf will use that instead of the 0.9 version that is also
installed.

I’m not sure why sorting by :id is so slow. It takes like 60 seconds or
more to return a query sorted by id, and only like 0.5 seconds when not
sorted. Weird.

And, ryan, it looks like I’m not storing the ferret_created_at field as
untokenized, so that must be my problem. I’ll have to make that
untokenized and reindex (booooo).

Thanks again
Ian

ianzabel · September 2, 2006, 5:19am

On 9/1/06, Ian Z. [email protected] wrote:

I’m not sure why sorting by :id is so slow. It takes like 60 seconds or
more to return a query sorted by id, and only like 0.5 seconds when not
sorted. Weird.

Hi Ian,

Try optimizing the index.

Sorting results by a field will naturally take a little longer then
sorting the results by relevancy because an index needs to be built
for that field. Once the sort-index is built it is cached for the
IndexReader so future sorts should be almost as fast getting unsorted
results.

To build the index Ferret needs to iterate through all the terms in
the index. This takes significantly longer for unoptimized indexes.
Here is a quick benchmark you can try running;

require 'ferret'

include Ferret

words = %w{one two three four five six seven eight nine ten}

i = I.new

start_time = Time.now
100000.times { i << {:id => rand(1000000), :content =>

words[rand(10)]}}
puts “Building index took #{Time.new - start_time} seconds”

start_time = Time.now
i.search("one", :sort => :id)
puts "Sort by integer took #{Time.new - start_time} seconds the

first time"

start_time = Time.now
i.search("one", :sort => :id)
puts "Sort by integer took #{Time.new - start_time} seconds the

second time"

i.__send__(:ensure_writer_open) # get rid of sort cache

start_time = Time.now
i.search("one", :sort => [Ferret::Search::SortField.new(:id, :type

=> :byte)])
puts “Sort by bytes took #{Time.new - start_time} seconds the first
time”

start_time = Time.now
i.search("one", :sort => [Ferret::Search::SortField.new(:id, :type

=> :byte)])
puts “Sort by bytes took #{Time.new - start_time} seconds the second
time”

puts "\nOPTIMIZING THE INDEX\n"
start_time = Time.now
i.optimize
puts "Optimizing the index took #{Time.new - start_time} seconds"

start_time = Time.now
i.search("one", :sort => :id)
puts "Sort by integer took #{Time.new - start_time} seconds the

first time"

start_time = Time.now
i.search("one", :sort => :id)
puts "Sort by integer took #{Time.new - start_time} seconds the

second time"

i.__send__(:ensure_writer_open) # get rid of sort cache

start_time = Time.now
i.search("one", :sort => [Ferret::Search::SortField.new(:id, :type

=> :byte)])
puts “Sort by bytes took #{Time.new - start_time} seconds the first
time”

start_time = Time.now
i.search("one", :sort => [Ferret::Search::SortField.new(:id, :type

=> :byte)])
puts “Sort by bytes took #{Time.new - start_time} seconds the second
time”

And here are the results on my system;

Building index took 36.131648 seconds
Sort by integer took 15.39588 seconds the first time
Sort by integer took 0.002627 seconds the second time
Sort by bytes took 15.889957 seconds the first time
Sort by bytes took 0.001914 seconds the second time

OPTIMIZING THE INDEX
Optimizing the index took 0.639831 seconds

Sort by integer took 0.170887 seconds the first time
Sort by integer took 0.001423 seconds the second time
Sort by bytes took 0.029054 seconds the first time
Sort by bytes took 0.001424 seconds the second time

So optimizing the index before sorting should help a lot.

Cheers,
Dave

ianzabel · September 4, 2006, 3:40am

Thanks for all the help, everyone.

I am now using this statement in my model: acts_as_ferret :fields => {
‘comment’ => {}, :forum_id => {:index => :untokenized}, ‘mod_type’ =>
{:index => :untokenized} , ‘user_id’ => {:index => :untokenized} ,
‘ferret_created_at’ => {:index => :untokenized} }

I rebuilt the index, and sorting now seems to work properly with both
“ferret_created_at” and “id”, like so

sort_fields = []
sort_fields <<
Ferret::Search::SortField.new(“ferret_created_at”,:reverse => :true)
or
sort_fields << Ferret::Search::SortField.new(“id”,:reverse => :true)
Comment.find_by_contents(“test”, :sort => sort_fields, :limit => 5)

Sorting by id is now MUCH faster, as well.

The only thing I notice now is that the index is MUCH larger. The index
is now about 91MB, whereas before I changed the aaf settings for the
model, it was about 20MB. I guess untokenized values take up a lot more
space?

Thanks again!
Ian.

ianzabel · September 4, 2006, 2:44pm

On Mon, Sep 04, 2006 at 01:25:52PM +0900, David B. wrote:

On 9/4/06, Ian Z. [email protected] wrote:
[…]
:ferret_created_at => {:index => :untokenized, :store => :no,
:term_vectors => :no}

:store => :no is already the default used by acts_as_ferret, no need to
explicitly specify this.

term vectors are stored by default :with_positions_offsets, so turning
them off might help a bit.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

ianzabel · September 4, 2006, 2:44pm

On 9/4/06, Ian Z. [email protected] wrote:

sort_fields = []
sort_fields <<
Ferret::Search::SortField.new(“ferret_created_at”,:reverse => :true)
or
sort_fields << Ferret::Search::SortField.new(“id”,:reverse => :true)
Comment.find_by_contents(“test”, :sort => sort_fields, :limit => 5)

Sorting by id is now MUCH faster, as well.

Great to hear.

The only thing I notice now is that the index is MUCH larger. The index
is now about 91MB, whereas before I changed the aaf settings for the
model, it was about 20MB. I guess untokenized values take up a lot more
space?

That can be correct but it is surprising for your schema. For example,
imagine the following six documents;

"one two three" (13-bytes)
"one three two"
"two three one"
"two one three"
"three one two"
"three two one"

If you tokenized the fields you’d have tree terms “one” (3-bytes),
“two” (3-bytes), “three” (5-bytes) and each term would use six bytes
to store the doc_ids of the documents they occur in. So you’d have 3 +
3 + 5 + 3*6 = 29 bytes. Storing the fields as untokenized would take
13 bytes per field plus 1 byte to signify the document each field
occurs in which would be (13 + 1) * 6 = 84 bytes. Of course this is a
simplification of what is really going on. There is a lot of
compression happening and a lot of other data is stored as well like
term positions, term frequencies, term-vectors as well as actually
storing the data.

Now, if you want to save space, there are a few other parameters you
can set. You can start by discarding :term_vectors. These are used for
excerpts and match highlighting but are unnecessary in most cases.
Also, there is no need to store all your data. Often, the only fields
you’ll want to store are the model IDs. If you aren’t referencing the
field in the document from the Ferret index, don’t bother storing it.
So for example; :ferret_created_at could be

:ferret_created_at => {:index => :untokenized, :store => :no,

:term_vectors => :no}

Note also I recommend always using Symbols for your field names rather
than Strings.

Cheers,
Dave