True ActiveRecord result set iteration

granden · November 28, 2008, 5:16pm

Hello,

For an internal project we were looking for a solution which extends
ActiveRecord by an iterator function. Using find(:all) is no fun if you
have to process some 10.000 records.

Until now there have been two ways of dealing with that scenario:

write your logic a second times (e.g. use stored procedure)
bet on AR plugins which work around that limitation by fetching IDs
and
processing them in groups, or by using :offset and :limit

Rewriting logic is something we wanted to avoid, and the plugins don’t
fully
respect transactional contexts. So we started to implement our own true
iterator support for AR. Our project is on Oracle, but in the meantime
we
have added support for MySQL. Probably other adapters can be extended
easily, too. We also tried JRuby 1.1.x, which is sometimes faster than
Ruby
1.8.6, but a patch is needed to bring the Java part of the connection
adapter into shape for a result set iteration.

Okay, you’re about to ask: how does it work. Here we go:

MyTable.find_each_by_sql(“some SQL here”) { |my_table| … }

MyTable.find_each(:all, :conditions => …,
:include => …) { |my_table| … }

Attached you find the magic code which can be used as a plugin for
Rails.
When testing, please keep in mind that only Oracle and MySQL is fully
supported. JDBC will take lots of RAM for large result sets until you
have
patched the JdbcConnectionAdapter.

Some figures with JRuby:
I’ve tested the code for an export of ~80.000 customer data records.
Originally I couldn’t run the export with heap space less than 2 GB
(JRuby
1.1.4 without extensive garbage collection). After having patched the
connection adapter, it works with less than 128 MB heap space (JRuby
1.1).

I’d be happy if our idea would be picked up and AR would get these
iterator
methods integrated. I’ve seen lots of people asking for exactly this
behavior. It’s possible to implement, it’s easy to implement, and IMHO
it
doesn’t break the AR metaphor.

If you like the idea and want to send feedback, please CC me. I’m not
subscribed to the list.

Regards,
Andreas

OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
Andreas G. (e-mail: [email protected])
Tel. +49-351-49723-140 - Fax: +49-351-49723-119

AG Dresden, HRB 2475
GeschÃ¤ftsfÃ¼hrer Burkhard Arrenberg, Jens Gruhl

granden · November 28, 2008, 5:23pm

On Nov 28, 4:15 pm, Andreas G. [email protected] wrote:

I’d be happy if our idea would be picked up and AR would get these iterator
methods integrated. I’ve seen lots of people asking for exactly this
behavior. It’s possible to implement, it’s easy to implement, and IMHO it
doesn’t break the AR metaphor.

You might want to discuss this on the rubyonrails-core list. It’s
where people who want to discuss development of rails itself hang out.

Fred

granden · December 2, 2008, 3:50pm

On Sat, Nov 29, 2008 at 8:14 AM, Andreas G. [email protected] wrote:

MyTable.find_each_by_sql(“some SQL here”) { |my_table| … }

MyTable.find_each(:all, :conditions => …,
:include => …) { |my_table| … }

Any chance of having these work by scope/proxy, instead? For instance:

MyTable.all.each do |record|
…
end

Or, something like this, with named scopes:

User.registered.customers.each do |user|
…
end

I think this already works, actually – it just implicitly converts the
AssociationProxy to an array. In other words, it slurps the entire
result
set. It would be nice if this behavior could be made to transparently
(or
optionally) do what you’ve implemented, rather than inventing a whole
new
interface – for example, find_each doesn’t help when you want to use
other
things from Enumerable, like select/reject, inject, etc.

granden · December 2, 2008, 4:56pm

David,

I don’t think it would be possible to support the enumerable interface
in this extension because it would negate the memory management
benefits that are the reason for using such a technique. Looking at
the code, it looks like Andreas instantiates one ruby object at a time
from the result set, then yields that object to the block for work to
be done on it and frees the object after the yield returns (it’s a
little fancier than this with the Oracle extension, but in the MySQL
version that’s about it). If you were to allow for things like select/
reject, inject, etc then you’re back in a situation where some
potentially large subset (or all) of those ruby objects are going to
be held in memory.

As for named scopes, like you said, I think that just works. The named
scopes just insert there additional magic into the final query that
gets made (adds additional conditions, joins, etc).

Cheers,
Andrew

granden · December 2, 2008, 10:16pm

I understand – it would be a case where you do not want to deliberately
cache the resultset, and so, it would not be quite like an association
proxy.

But of course, each is all that’s needed for things like
select/reject/inject. And inject, in particular, should be safe –
select/reject, any?, all?, and similar things will still benefit from
not
holding all items in memory, even if all items will have to be
iterated
over anyway.

In fact, it’s my understanding that this is the main reason for having
an
enumerable interface – when using such an interface, you’re not
required to
know or care whether it’s a native array, the result of slurping the
entire
table, one select call per iteration, or anything in between.

Of course, there are also the things like sort, uniq, and map that would
ruin your whole day – though map, at least, could be implemented in a
lazier way, for this purpose.

granden · December 2, 2008, 5:12pm

On 2 Dec 2008, at 15:55, acechase wrote:

version that’s about it). If you were to allow for things like select/
reject, inject, etc then you’re back in a situation where some
potentially large subset (or all) of those ruby objects are going to
be held in memory.

I don’t thing that’s necessarily true.

Consider for example

(1…10000000).select {|x| x == 100000}

This range contains ten million elements, but executing that statement
doesn’t require having all of those in memory at any given time (keep
an eye on ruby’s memory consumption while this runs)

On the other hand this would (1…10000000).to_a.select {|x| x == 100000}

Of course if the result set you are returning is large eg
(1…10000000).to_a.select {|x| x % 2 == 0 } then that will take a big
chunk of memory but that seems fair enough.

Fred

granden · December 3, 2008, 5:51am

Hey Andreas,

I just quickly went through your changes and it looks like a good
start. A few things :

Could you please open a ticket at
http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview
and upload a .diff file with the patch written for Rails ? Check
http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview
for more help
Patch is missing tests
Those JDBC patches don’t really belong to Rails

Also, it’ll really help if you could join the mailing list while we
talk about your patch. Most of the people are likely to forget CCing
you. But if you don’t want to join our lil group, that’s not a big
deal

Thanks for working on this.

On Sat, Nov 29, 2008 at 3:14 PM, Andreas G. [email protected] wrote:

MyTable.find_each_by_sql(“some SQL here”) { |my_table| … }
I’ve tested the code for an export of ~80.000 customer data records.
If you like the idea and want to send feedback, please CC me. I’m not
Geschäftsführer Burkhard Arrenberg, Jens Gruhl

–
Cheers!

Pratik
http://m.onkey.org

granden · December 3, 2008, 10:09am

The one problem I see with this approach is that it does not allow for
the eager-loading of polymorphic associations. In fact, as the code
stands now I couldn’t get vanilla :include options to work,
only :joins (but I think that’s just a little bug in the parsing of
the options hash).

The problem is, when Rails uses the eager include approach rather than
the join approach it looks through the entire result set to see which
associations need to be queried and then grabs those associations in
separate queries. This is relatively efficient because the object
graph is being built out for all objects at once, my guess is that
doing the same thing one object at a time would be very difficult to
do efficiently.

As an alternative approach, the project I work on uses a home-grown
version of what Andreas mentions as one of the two options – we fetch
IDs and process them in groups. The one thing I really like about that
approach is all the same finder options still work as expected with no
additional programming required. IMO losing eager includes of
polymorphic associations is a non-starter.

just my 2 cents

-ac

granden · December 6, 2008, 8:52pm

Am Tuesday 02 December 2008 schrieb Pratik:

Patch is missing tests

Those JDBC patches don’t really belong to Rails

Well, it’s just that people would wonder about the JDBC behavior. You
risk
an OOM killed process if you work without the patches for large sets of
data.
If the adapter maintainers picked up the changes, the adapter would work
out
of the box.

Reading the other comments about the incompleteness of the #find_each
part,
I do wonder if I better focus on the first part for now.
#find_each_by_sql
is pretty straight forward. Probably the other part can be improved, let
me
see.

In general I see the main benefit in the impressively decreased memory
need
when working with large result sets. It’s still a single transaction
which
is what I missed in the other plugins which work around the memory
limitation by multiple selects.

Also, it’ll really help if you could join the mailing list while we
talk about your patch. Most of the people are likely to forget CCing
you. But if you don’t want to join our lil group, that’s not a big
deal

You’re right.

I hope to find some more time over the Xmas holidays to put some more
effort
into the current implementation. I certainly get back then.

write your logic a second times (e.g. use stored procedure)

fully supported. JDBC will take lots of RAM for large result sets until
I’d be happy if our idea would be picked up and AR would get these
OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
Andreas G. (e-mail: [email protected])
Tel. +49-351-49723-140 - Fax: +49-351-49723-119

AG Dresden, HRB 2475
Geschäftsführer Burkhard Arrenberg, Jens Gruhl

–
OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
Andreas G. (e-mail: [email protected])
Tel. +49-351-49723-140 - Fax: +49-351-49723-119

AG Dresden, HRB 2475
Geschäftsführer Burkhard Arrenberg, Jens Gruhl

True ActiveRecord result set iteration

Regards, Andreas

Regards,
Andreas