Forum: Ruby on Rails True ActiveRecord result set iteration

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Andreas G. (Guest)
on 2008-11-28 18:16
(Received via mailing list)
Attachment: active_record_iterator.tgz (0 Bytes)
Hello,

For an internal project we were looking for a solution which extends
ActiveRecord by an iterator function. Using find(:all) is no fun if you
have to process some 10.000 records.

Until now there have been two ways of dealing with that scenario:
- write your logic a second times (e.g. use stored procedure)
- bet on AR plugins which work around that limitation by fetching IDs
and
processing them in groups, or by using :offset and :limit

Rewriting logic is something we wanted to avoid, and the plugins don't
fully
respect transactional contexts. So we started to implement our own true
iterator support for AR. Our project is on Oracle, but in the meantime
we
have added support for MySQL. Probably other adapters can be extended
easily, too. We also tried JRuby 1.1.x, which is sometimes faster than
Ruby
1.8.6, but a patch is needed to bring the Java part of the connection
adapter into shape for a result set iteration.

Okay, you're about to ask: how does it work. Here we go:

MyTable.find_each_by_sql("some SQL here") { |my_table| ... }

MyTable.find_each(:all, :conditions => ...,
   :include => ...) { |my_table| ... }

Attached you find the magic code which can be used as a plugin for
Rails.
When testing, please keep in mind that only Oracle and MySQL is fully
supported. JDBC will take lots of RAM for large result sets until you
have
patched the JdbcConnectionAdapter.

Some figures with JRuby:
I've tested the code for an export of ~80.000 customer data records.
Originally I couldn't run the export with heap space less than 2 GB
(JRuby
1.1.4 without extensive garbage collection). After having patched the
connection adapter, it works with less than 128 MB heap space (JRuby
1.1).


I'd be happy if our idea would be picked up and AR would get these
iterator
methods integrated. I've seen lots of people asking for exactly this
behavior. It's possible to implement, it's easy to implement, and IMHO
it
doesn't break the AR metaphor.

If you like the idea and want to send feedback, please CC me. I'm not
subscribed to the list.

Regards,
Andreas
--
OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
Andreas G. (e-mail: removed_email_address@domain.invalid)
Tel. +49-351-49723-140  -  Fax: +49-351-49723-119

AG Dresden, HRB 2475
Geschäftsführer Burkhard Arrenberg, Jens Gruhl
Frederick C. (Guest)
on 2008-11-28 18:23
(Received via mailing list)
On Nov 28, 4:15 pm, Andreas G. <removed_email_address@domain.invalid> wrote:
>
> I'd be happy if our idea would be picked up and AR would get these iterator
> methods integrated. I've seen lots of people asking for exactly this
> behavior. It's possible to implement, it's easy to implement, and IMHO it
> doesn't break the AR metaphor.

You might want to discuss this on the rubyonrails-core list. It's
where people who want to discuss development of rails itself hang out.

Fred
David M. (Guest)
on 2008-12-02 16:50
(Received via mailing list)
On Sat, Nov 29, 2008 at 8:14 AM, Andreas G. 
<removed_email_address@domain.invalid> wrote:

>
> MyTable.find_each_by_sql("some SQL here") { |my_table| ... }
>
> MyTable.find_each(:all, :conditions => ...,
>   :include => ...) { |my_table| ... }
>
>
Any chance of having these work by scope/proxy, instead? For instance:

MyTable.all.each do |record|
  ...
end

Or, something like this, with named scopes:

User.registered.customers.each do |user|
  ...
end

I think this already works, actually -- it just implicitly converts the
AssociationProxy to an array. In other words, it slurps the entire
result
set. It would be nice if this behavior could be made to transparently
(or
optionally) do what you've implemented, rather than inventing a whole
new
interface -- for example, find_each doesn't help when you want to use
other
things from Enumerable, like select/reject, inject, etc.
acechase (Guest)
on 2008-12-02 17:56
(Received via mailing list)
David,

I don't think it would be possible to support the enumerable interface
in this extension because it would negate the memory management
benefits that are the reason for using such a technique. Looking at
the code, it looks like Andreas instantiates one ruby object at a time
from the result set, then yields that object to the block for work to
be done on it and frees the object after the yield returns (it's a
little fancier than this with the Oracle extension, but in the MySQL
version that's about it). If you were to allow for things like select/
reject, inject, etc then you're back in a situation where some
potentially large subset (or all) of those ruby objects are going to
be held in memory.

As for named scopes, like you said, I think that just works. The named
scopes just insert there additional magic into the final query that
gets made (adds additional conditions, joins, etc).

Cheers,
Andrew
Frederick C. (Guest)
on 2008-12-02 18:12
(Received via mailing list)
On 2 Dec 2008, at 15:55, acechase wrote:

> version that's about it). If you were to allow for things like select/
> reject, inject, etc then you're back in a situation where some
> potentially large subset (or all) of those ruby objects are going to
> be held in memory.

I don't thing that's necessarily true.

Consider for example

(1..10000000).select {|x| x == 100000}

This range contains ten million elements, but executing that statement
doesn't require having all of those in memory at any given time (keep
an eye on ruby's memory consumption while this runs)

On the other hand this would (1..10000000).to_a.select {|x| x == 100000}

Of course if the result set you are returning is large eg
(1..10000000).to_a.select {|x| x % 2 == 0 } then that will take a big
chunk of memory but that seems fair enough.

Fred
David M. (Guest)
on 2008-12-02 23:16
(Received via mailing list)
I understand -- it would be a case where you do not want to deliberately
cache the resultset, and so, it would not be quite like an association
proxy.

But of course, each is all that's needed for things like
select/reject/inject. And inject, in particular, should be safe --
select/reject, any?, all?, and similar things will still benefit from
not
_holding_ all items in memory, even if all items will have to be
iterated
over anyway.

In fact, it's my understanding that this is the main reason for having
an
enumerable interface -- when using such an interface, you're not
required to
know or care whether it's a native array, the result of slurping the
entire
table, one select call per iteration, or anything in between.

Of course, there are also the things like sort, uniq, and map that would
ruin your whole day -- though map, at least, could be implemented in a
lazier way, for this purpose.
Pratik N. (Guest)
on 2008-12-03 06:51
(Received via mailing list)
Hey Andreas,

I just quickly went through your changes and it looks like a good
start. A few things :

- Could you please open a ticket at
http://rails.lighthouseapp.com/projects/8994-ruby-...
and upload a .diff file with the patch written for Rails ? Check
http://rails.lighthouseapp.com/projects/8994-ruby-...
for more help
- Patch is missing tests
- Those JDBC patches don't really belong to Rails :)

Also, it'll really help if you could join the mailing list while we
talk about your patch. Most of the people are likely to forget CCing
you. But if you don't want to join our lil group, that's not a big
deal :)

Thanks for working on this.

On Sat, Nov 29, 2008 at 3:14 PM, Andreas G. 
<removed_email_address@domain.invalid> wrote:
>
> MyTable.find_each_by_sql("some SQL here") { |my_table| ... }
> I've tested the code for an export of ~80.000 customer data records.
> If you like the idea and want to send feedback, please CC me. I'm not
> Geschäftsführer Burkhard Arrenberg, Jens Gruhl
>
> >
>



--
Cheers!
- Pratik
http://m.onkey.org
acechase (Guest)
on 2008-12-03 11:09
(Received via mailing list)
The one problem I see with this approach is that it does not allow for
the eager-loading of polymorphic associations. In fact, as the code
stands now I couldn't get vanilla :include options to work,
only :joins (but I think that's just a little bug in the parsing of
the options hash).

The problem is, when Rails uses the eager include approach rather than
the join approach it looks through the entire result set to see which
associations need to be queried and then grabs those associations in
separate queries. This is relatively efficient because the object
graph is being built out for all objects at once, my guess is that
doing the same thing one object at a time would be very difficult to
do efficiently.

As an alternative approach, the project I work on uses a home-grown
version of what Andreas mentions as one of the two options -- we fetch
IDs and process them in groups. The one thing I really like about that
approach is all the same finder options still work as expected with no
additional programming required. IMO losing eager includes of
polymorphic associations is a non-starter.

just my 2 cents :)

-ac
Andreas G. (Guest)
on 2008-12-06 21:52
(Received via mailing list)
Am Tuesday 02 December 2008 schrieb Pratik:
> - Patch is missing tests
> - Those JDBC patches don't really belong to Rails :)

Well, it's just that people would wonder about the JDBC behavior. You
risk
an OOM killed process if you work without the patches for large sets of
data.
If the adapter maintainers picked up the changes, the adapter would work
out
of the box.

Reading the other comments about the incompleteness of the #find_each
part,
I do wonder if I better focus on the first part for now.
#find_each_by_sql
is pretty straight forward. Probably the other part can be improved, let
me
see.

In general I see the main benefit in the impressively decreased memory
need
when working with large result sets. It's still a single transaction
which
is what I missed in the other plugins which work around the memory
limitation by multiple selects.

> Also, it'll really help if you could join the mailing list while we
> talk about your patch. Most of the people are likely to forget CCing
> you. But if you don't want to join our lil group, that's not a big
> deal :)

You're right. ;-)

I hope to find some more time over the Xmas holidays to put some more
effort
into the current implementation. I certainly get back then.

> > - write your logic a second times (e.g. use stored procedure)
> >
> > fully supported. JDBC will take lots of RAM for large result sets until
> > I'd be happy if our idea would be picked up and AR would get these
> > OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
> > Andreas G. (e-mail: removed_email_address@domain.invalid)
> > Tel. +49-351-49723-140  -  Fax: +49-351-49723-119
> >
> > AG Dresden, HRB 2475
> > Geschäftsführer Burkhard Arrenberg, Jens Gruhl
> >
> >


--
OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany
Andreas G. (e-mail: removed_email_address@domain.invalid)
Tel. +49-351-49723-140  -  Fax: +49-351-49723-119

AG Dresden, HRB 2475
Geschäftsführer Burkhard Arrenberg, Jens Gruhl
This topic is locked and can not be replied to.