Scalable alternative to #find_all

Pedro_CSSSSrte-Real · July 5, 2006, 12:36pm

Is there any scalable alternative to iterating all the records of a
certain model? #find_all seems to load everything into memory. With
500.000 records it will be a swap storm.

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 1:13pm

Pedro Côrte-Real wrote:

Is there any scalable alternative to iterating all the records of a
certain model? #find_all seems to load everything into memory. With
500.000 records it will be a swap storm.

Pedro.

Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails
It is hard to answer this question without knowing what exactly you
want to do with the results.

Matthew M.
blog.mattmargolis.net

Pedro_CSSSSrte-Real · July 5, 2006, 1:16pm

On 7/5/06, Matthew M. [email protected] wrote:

It is hard to answer this question without knowing what exactly you
want to do with the results.

I’m trying to fix acts_as_ferret#rebuild_index to not consume a bunch
of memory when rebuilding a large index. The code is very simple, it
just iterates all the model objects adding them to the ferret index.
They’re only used once and can then be forgotten.

Before I used acts_as_ferret I had my own indexing solution. What I
did in my rebuild_index was to fetch 1000 records, index them, and
then fetch the next 1000. But this is probably wrong if we have
concurrent modifications.

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 2:29pm

Pedro CÃ´rte-Real wrote:

The code is very simple, it
just iterates all the model objects adding them to the ferret index.
They’re only used once and can then be forgotten.

It sounds as though you should not be creating model objects in the
first place. This is an unnecessary overhead. I am not familiar with low
level SQL access from Ruby but I’d start with the ActiveRecord source
code the find_by_sql method. Look for the code that loads up record
objects and make a copy, adding whatever you need to load your index
instead.

Julian

Pedro_CSSSSrte-Real · July 5, 2006, 2:57pm

On 7/5/06, Julian G. [email protected] wrote:

It sounds as though you should not be creating model objects in the
first place. This is an unnecessary overhead. I am not familiar with low
level SQL access from Ruby but I’d start with the ActiveRecord source
code the find_by_sql method. Look for the code that loads up record
objects and make a copy, adding whatever you need to load your index
instead.

The index fields are taken from the model and aren’t necessarily
database fields, some might be model methods that calculate something.
When doing the initial data import the indexing works fine by indexing
each record as it is sent to the database. Reindexing is slow because
of all the wasted memory of #find_all. Creating model objects is a
small overhead and it’s required to respect DRY.

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 3:32pm

On 7/5/06, Wilson B. [email protected] wrote:

You probably want select_all.
ActiveRecord::Base.connection.select_all “select a, b, c, d from blah
where x = 1”

I index based on the actual model objects and not the direct content
in the database, so this won’t help me directly. I guess I could use
it to fetch the list of id’s in a single query and then fetch the
model objects for it in blocks.

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 3:01pm

On 7/5/06, Pedro Côrte-Real [email protected] wrote:

did in my rebuild_index was to fetch 1000 records, index them, and
then fetch the next 1000. But this is probably wrong if we have
concurrent modifications.

You probably want select_all.
ActiveRecord::Base.connection.select_all “select a, b, c, d from blah
where x = 1”

http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/DatabaseStatements.html#M000622

Pedro_CSSSSrte-Real · July 5, 2006, 4:17pm

On 7/5/06, Tom W. [email protected] wrote:

I index based on the actual model objects and not the direct content
in the database, so this won’t help me directly. I guess I could use
it to fetch the list of id’s in a single query and then fetch the
model objects for it in blocks.

You could always use :limit and :offset in your find :all query, and
manually paginate through your active record objects, e.g:

.find :all, :limit => 500, :offset => 0 # first 500
.find :all, :limit => 500, :offset => 500 # second 500

Yes, and that’s what I did before but is this run inside a transaction
or will this break if concurrent modifications are hapenning?

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 4:04pm

On 7/5/06, Pedro CÃ´rte-Real [email protected] wrote:

On 7/5/06, Wilson B. [email protected] wrote:

You probably want select_all.
ActiveRecord::Base.connection.select_all “select a, b, c, d from blah
where x = 1”

I index based on the actual model objects and not the direct content
in the database, so this won’t help me directly. I guess I could use
it to fetch the list of id’s in a single query and then fetch the
model objects for it in blocks.

You could always use :limit and :offset in your find :all query, and
manually paginate through your active record objects, e.g:

.find :all, :limit => 500, :offset => 0 # first 500
.find :all, :limit => 500, :offset => 500 # second 500

Tom

Pedro_CSSSSrte-Real · July 5, 2006, 5:46pm

On 7/5/06, Wilson B. [email protected] wrote:

records 500 at a time.
Actually, that should probably be SomeModel.count-1, but it shouldn’t matter.

Yep, this is basically what I had before but with the transaction
protecting it. Didn’t know about this method. Thanks.

Pedro.

Pedro_CSSSSrte-Real · July 5, 2006, 5:20pm

On 7/5/06, Pedro Côrte-Real [email protected] wrote:

.find :all, :limit => 500, :offset => 500 # second 500

Yes, and that’s what I did before but is this run inside a transaction
or will this break if concurrent modifications are hapenning?

You could do it this way:
SomeModel.transaction do
batch_size = 500
0.step(SomeModel.count, batch_size) do |i|
batch = self.find :all, :limit => batch_size, :offset => i
# do something with batch here.
end
end

This will wrap the whole thing in a transaction, and then fetch the
records 500 at a time.
Actually, that should probably be SomeModel.count-1, but it shouldn’t
matter.