Usage patterns and trusted libraries

davidjrice · July 27, 2007, 12:28am

Hi y’all,

This may end up being regarded as an incendiary posting, but it’s not
meant to be. This is just an observation from a relative Ruby (in
general) and Rails (in particular) newb.

So I’m beavering away at my lovely little start-up desk and really
rather enjoying Ruby (in between the moments of utter frustration
and I start coding up some ETL processes to load and merge masses of
data into my bouncing baby web-system. And all is relatively good
until I get to my first tricky merge process where I have to
disambiguate names and otherwise harmonize my various data sources.

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

I tweak DB indices. I get out CachedModel. I read a lot of code and
eat heaps of metaprogrammed object spaghetti. I run benchmarks. And I
finally conclude that my access patterns are totally defeating the
metaprogramming and requiring excessive DB traffic- even though I
can’t really prove it.

So eventually I rewrote the program using a different language (which
I don’t mention to avoid starting a flame-war - I could have used Ruby
on top of the MySQL interfaces) with a cache strategy that is better
suited to the DB access pattern.

The new process takes 55 seconds.

The moral of the story: there isn’t one really. If I was 100% sure
that I wouldn’t need to re-run the data (either because of
undiscovered bugs or b0rk3n data from the vendor) the 12 hour run
would have been an efficient use of my time, probably. But performance
does matter when it has an impact on the amount of time I have to
spend waiting for critical-path processes. If there is a moral, it is
simply: know your tools. And that community excitement doesn’t
substitute for good documentation.

Anyway, I am still happy with both Ruby and Rails. But this was a
lovely opportunity to re-learn a lesson I’ve learned too many times
before.

david rush

davidjrice · July 27, 2007, 12:59am

On Jul 26, 2007, at 6:25 PM, David R. wrote:

until I get to my first tricky merge process where I have to
disambiguate names and otherwise harmonize my various data sources.

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

david rush

David Rush on the Web – a very messy web^Wconstruction site

Although it may be too late, might I suggest that

ActiveWarehouse ETL

could be a good place to (re-)start?

The Rubyforge site for it is:
http://rubyforge.org/frs/?group_id=2435

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

davidjrice · July 27, 2007, 1:02am

David R. wrote:

Hi y’all,

Hi!

So eventually I rewrote the program using a different language (which
I don’t mention to avoid starting a flame-war - I could have used Ruby
on top of the MySQL interfaces) with a cache strategy that is better
suited to the DB access pattern.

The new process takes 55 seconds.

So, the question is, why did you pick ActiveRecrd in the first place,
and could there have been a better vetting process to select the most
appropriate DB tool?

The moral of the story: there isn’t one really. If I was 100% sure
that I wouldn’t need to re-run the data (either because of
undiscovered bugs or b0rk3n data from the vendor) the 12 hour run
would have been an efficient use of my time, probably. But performance
does matter when it has an impact on the amount of time I have to
spend waiting for critical-path processes. If there is a moral, it is
simply: know your tools. And that community excitement doesn’t
substitute for good documentation.

Second moral: There are many ways to build apps in Ruby; Rails is but
one. A few hours of research can save many more hours later on.

–
James B.

“A language that doesn’t affect the way you think about programming is
not worth knowing.”

A. Perlis

davidjrice · July 27, 2007, 6:39am

David R. wrote:

Hi y’all,

This may end up being regarded as an incendiary posting, but it’s not
meant to be. This is just an observation from a relative Ruby (in
general) and Rails (in particular) newb.

[snip]

Facts shouldn’t be considered incendiary. I thought your account
sounded fair.

I’m not surprised that AR is slower than direct db access. I am
surprised if it is THAT much slower. But of course, you can do
direct db access in any language, including Ruby.

Direct access to the database probably isn’t terribly railsy,
of course… but a factor of 700 slowdown is obviously
unacceptable.

And the fact that is such a large factor makes me think that
something must be wrong. I am sure this is not the typical
person’s experience…

Hal

davidjrice · July 27, 2007, 8:31am

On Jul 26, 11:57 pm, Rob B. [email protected]
wrote:

On Jul 26, 2007, at 6:25 PM, David R. wrote:

The process takes over 12 hours to run using ActiveRecord to provide
my DB access. For 5500 records.

Although it may be too late, might I suggest that
ActiveWarehouse ETL

Thank you. I will definitely take a look at it. There’s a lot more ETL
to do

david rush

davidjrice · July 27, 2007, 8:33am

Perhaps I should have moved
over to a ‘production’ environment?

Duh.

Regards,
Rimantas

davidjrice · July 27, 2007, 8:26am

On Jul 27, 5:38 am, Hal F. [email protected] wrote:

person’s experience…
Yes. The more I think about it, the more astonished I am. Especially
when my benchmarks which did straight row reads from the DB ended up
running within 20% of each other. I think the main source is a lot of
the ‘convenience’ features of the rails environment. As an example, I
know that I put in a lot of effort to stop Rails from introspecting
the columns of my habtm’s - this is a huge time-waster, even if it
is handy when you’re rapidly prototyping. Perhaps I should have moved
over to a ‘production’ environment?

I’m sure that the single biggest thing was being able to directly
implement a cache policy that suited the application, rather than
manipulating it at the long end of a long pole. That fact is probably
also indicative of a number of other small multiplicative factor
errors (many of which are my fault I’m sure) which when compounded
cause the explosion in processing time. Nearly 3 decimal orders of
magnitude is just a huge margin - if I didn’t know that I used the
same algorithms, I’d be assuming that they must have been changed.

And that’s why I just threw it out as a data point. I’m a fairly
experienced professional. What I found frustrating was the difficulty
I had in discovering my performance issues - which I am sure I can’t
all lay at the feet of RoR. In my private musings after I posted last
night, I recalled the ‘agile development’ value expressed in the
introduction of Agile Web D. with Rails that prefers
working code to extensive documentation. Well I know of a few managers
in my day that could have learned to moderate their stance a bit based
on that advice, but the flip side is that documentation is crucial to
re-usability. But that’s a rabbit-hole I don’t particularly want to
explore today.

david rush

davidjrice · July 27, 2007, 8:55am

On 7/27/07, David R. [email protected] wrote:

should be as alike as possible - this also extends to dev unless you
want to spend a lot of time chasing issues on massively unfamiliar
systems.

Would production have given me a speed-up of 3 decimal orders of
magnitude?

david rush

I don’t know that I could put a figure on it like that, but in
development
mode, each request reloads all your models and controller(s). I’m a bit
fuzzy on the exact reloading details, but this is what makes it so that
in
development mode you don’t need to restart the server each time you make
a
change.

It could have a huge impact. I guess it depends on your app.

HTH
Daniel

davidjrice · July 27, 2007, 8:46am

On Jul 27, 7:32 am, “Rimantas L.” [email protected] wrote:

Perhaps I should have moved
over to a ‘production’ environment?

Duh.

Perhaps I should have also included the implicit sarcasm implied by
the necessity of distinguishing a dedicated ‘production’ environment.
Good QA practice will nearly always tell you that test and production
should be as alike as possible - this also extends to dev unless you
want to spend a lot of time chasing issues on massively unfamiliar
systems.

Would production have given me a speed-up of 3 decimal orders of
magnitude?

david rush

davidjrice · July 27, 2007, 9:14am

2007/7/27, David R. [email protected]:

I tweak DB indices. I get out CachedModel. I read a lot of code and
The new process takes 55 seconds.
Makes me wonder: why did you choose a different language? As you said
yourself, you could have implemented the same strategy in Ruby as
well. Also, it seems for 5500 records you do not need a caching
strategy - you could just slurp in all the stuff into mem, do your
transformations and write it back. It seems with this approach even
AR would have provided sufficient performance, wouldn’t it?

Kind regards

robert

davidjrice · July 27, 2007, 10:31am

From: David R. [mailto:[email protected]]

The process takes over 12 hours to run using ActiveRecord to provide

my DB access. For 5500 records.

at 7 sec per record, that is interestingly terrible
can you post sample code?

kind regards -botp

davidjrice · July 28, 2007, 3:46pm

On Jul 27, 8:13 am, “Robert K.” [email protected]
wrote:

Makes me wonder: why did you choose a different language?

Because I found the Ruby AR code to be so heavily metaprogrammed, so
inbred, and so weakly documented (internals, not API) that after three
days of crawling around inside the code I figured that it would be
faster to hack together a weak system that was based on the queries,
rather than the relatively nice model I had to work with for the UI.
And because in other languages I know the assumptions made by the
interpreters, virtual machines and compilers in much greater detail
and can therefore accurately code for CPU-efficiency.

As you said
yourself, you could have implemented the same strategy in Ruby as
well. Also, it seems for 5500 records you do not need a caching
strategy - you could just slurp in all the stuff into mem, do your
transformations and write it back. It seems with this approach even
AR would have provided sufficient performance, wouldn’t it?

One would have thought. I did slurp all 5500 (BTW, this is and
intermediate load of a total data set of 25,000 - I have another
dataset of 20,000 waiting in the wings, and a deal I’m doing for
another million or so :). Those 5500 required another 26000+ records
of various other model types in order to get them properly embedded in
the model - which is still not a big deal. However, slurping the
entire DB of over 250,000 records seemed like a potentially bad
idea, given that I don’t have a good feel for the memory overheads in
Ruby yet.

I had a lot of difficulty figuring out where to hook the AR find
mechanisms to avoid multiple searches for the same records. Looking
back, I think I could do it now, but that is also because I have
dropped my expectation of what AR (or various net-available plugins)
will do and would just write my own access layer on top of AR.

Don’t you just love the perpetual recurrence of the glueware-on-
glueware antipattern?

david rush

davidjrice · July 28, 2007, 3:57pm

On Jul 27, 9:30 am, Peña, Botp [email protected] wrote:

From: David R. [mailto:[email protected]]

The process takes over 12 hours to run using ActiveRecord to provide

my DB access. For 5500 records.

at 7 sec per record, that is interestingly terrible
can you post sample code?

Reducing the context to fit in a news article would be difficult. I
suspect that the problematic data structure is a denormalized category
tree - at 6 levels deep with and average breadth of 7, I suspect there
was a lot of reloading going on while AR loaded more data than was
needed in this particular application.

Mind you, all the loads performed by AR would have been totally useful
for the UI - so this is not so much a criticism as an engineering post-
mortem. There are very few tools that do many things well

david rush