Best way to handle multiple tables to replace one big table?

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

With that, I’m looking for a good way to handle this in Rails, mainly by
trying to avoid loading a bunch of unused associations for each User
object. I’m guessing that other have done something similar, so there’s
probably a few good tips/recommendations out there.

(I know that I could use a partition for this, but, for now, I’ve
decided to go the ‘multiple tables’ route.)

Each user has their todo_items placed into a specific table. The user’s
“todo items” table is set when the user is created, and all of their
todo_items go into the same table. The data in their todo items
collection is private, so when it comes time to process a users
todo_items, I’ll only have to look at one table.

One thing I don’t particularly want to have is a bunch of unused
associations in the User class. Right now, it looks like I’d have to do
the following:

class User < ActiveRecord::Base
has_many :todo_items1, :todo_items2, :todo_items3, :todo_items4,
:todo_items5
end

class todo_items1 < ActiveRecord::Base
belongs_to :user
end

class todo_items2 < ActiveRecord::Base
belongs_to :user
end

class todo_items3 < ActiveRecord::Base
belongs_to :user
end

The thing is, for each individual user, only one of the “todo items”
tables would be usable/applicable/accessible since all of a user’s
todo_items are stored in the same table. This means only one of the
associations would be in use at any time and all of the other “has_many
:todo_itemsX” associations that were loaded would be a waste.

For example, with a user.id of 2, I’d only need
“todo_items3.find_by_text(‘search_word’)”, but the way I’m thinking of
setting this up (see above), I’d still have access to todo_items1,
todo_items2, todo_items4 and todo_items5.

I’m thinking that these “extra associations” adds extra overhead and
makes each User object’s size in memory much bigger than it has to be.

Also, there’s a bunch of stuff that Ruby/Rails is doing in the
background which may make this implementation. For example, I’m guessing
that there could be some additional method call/lookup overhead for each
User object, since it has to load all of those associations, which in
turn creates all of those nice, dynamic model accessor methods like
“User.find_by_something.”

I don’t really know Ruby/Rails does internally with all of those
has_many associations though, so maybe it’s not so bad. But right now
I’m thinking that it’s really wasteful, and that there may just be a
better, more efficient way of doing this.

So, a few questions:

  1. Is there’s some sort of special Ruby/Rails methodology that could be
    applied to this ‘multiple tables to represent one entity’ scheme? Are
    there any ‘best practices’ for this?

  2. Is it really bad to have so many unused has_many associations for
    each object? Is there a better way to do this?

  3. Does anyone have any advice on how to abstract the fact that there’s
    multiple “todo items” tables behind a single todo_items model/class? For
    example, so I can call “todo_items.find_by_text(‘search_phrase’)”
    instead of “todo_items3.find_by_text(‘search_phrase’),” or even
    @user.todo_items?”

Thank you!

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

This sounds to me like the very definition of “Premature Optimization.”

What you’re describing is called “Database Sharding.”

However, I would very highly recommend the simpler design, unless you
are having scaling problems right now. It sounds like you’re at the
beginning stages of design so I doubt that is the case.

Putting optimization before design leads to complex, difficult to
maintain systems. At present you have no verifiable metrics to determine
whether the more complex design will ever be needed. Why pay the cost
for it now? Wouldn’t it be smarter to implement optimizations after you
have solid metrics to prove the need for said optimizations?

Robert W. wrote:

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

This sounds to me like the very definition of “Premature Optimization.”

Program optimization - Wikipedia

What you’re describing is called “Database Sharding.”

Shard (database architecture) - Wikipedia

However, I would very highly recommend the simpler design, unless you
are having scaling problems right now. It sounds like you’re at the
beginning stages of design so I doubt that is the case.

Putting optimization before design leads to complex, difficult to
maintain systems. At present you have no verifiable metrics to determine
whether the more complex design will ever be needed. Why pay the cost
for it now? Wouldn’t it be smarter to implement optimizations after you
have solid metrics to prove the need for said optimizations?

Hi Robert,

Thank you for your response. I totally agree that not everything needs
to be optimized upfront, but there’s also a problem with waiting for the
metrics: When I have those metrics that prove the need for these
optimizations, the performance will already be getting affected. And
depending on how closely the performance is being watched, it could
already be something that’s annoying to the user, and that’s not good.

That, and I think it’s easier, in this case, to have this built-in from
the start, rather than having to add this in later on which would
(probably) cause me to have to shut down the database when it comes time
to do the actual upgrade, which would be another inconvenience for the
user. I’d also like to avoid having to move the pre-existing data around
for optimization, which I know I’d end up doing if I postpone this.

Creating the multiple tables now, when there’s no real data, takes care
of those issues and helps me get the infrastructure in place and
well-tested.

Also, splitting up the tables upfront will help to delay the need for
some future optimization. since index performance starts to drop once
the tables get too large (e.g. around one million records or so). If I
make five tables then, in theory, that index performance decrease
shouldn’t happen until those tables reach 1 million+ records, which will
take longer since the data is being spread across the tables.

As for sharding, thank you for the advice, but it seems like an even
more complicated solution than just using a few additional tables. When
looking up sharding, I found a number people that recommend against it
unless it’s absolutely necessary. So, yes, sharding would be a good
thing to wait for metrics before implementing it.

The links I found for the cons of sharding are from Updates 2, 3 and 4
at the following site:

So, I’m still thinking that I want to split up the tables upfront. I do
agree that “premature optimization” isn’t always a good thing, but it’s
not always a bad thing either. :slight_smile:

(Also, on another note, I’m using PostgreSQL and I only have the one
database server, so moving a table/database off to another server isn’t
an option right now. [And I know that the shards can be on the same
server, but it doesn’t look like something I want to implement right
now.])

So, any advice on a good Ruby/Rails way to handle this?

Thanks,
Mike

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB
system is designed to handle tables in the millions of rows without
breathing hard. Just index your tables properly and make sure your
queries are efficient.

If you have to use your DB’s sharding features at some point, go for it.
But don’t even think of doing this in the app layer.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Hi Marnen,

Thank you also for your post and advice. It actually made me think more
about this. So you’re saying that even if I had millions of records in a
table, and was noticing a decrease in performance, that I shouldn’t
split up the tables this way? Would a shard be the only option?

One of the things that I keep coming back to when I think about the
single-table option is that one of the fields will be rather large, like
a blog-post in size. And I plan on indexing that to make it searchable.
Also, there’s a lot of insertions and deletions. Does it still make
sense to include all of that in a single table? Or does that change any
of your advice?

I agree that this is an application layer split, sort of; but I don’t
think that this particular split is a bad thing.

If I did this in pure Ruby, for example, I’m guessing that the effect
would be minimal (just one class/method that looks up the correct table,
and returns the object for that table). What’s (potentially) making the
app layer bloated is all of the unused ‘has_many’ attributes since I’m
trying to stick with Rails. I really do want to be able to take
advantage of the nice Rails methods and functionality… I’m just trying
to avoid having a bunch of unused associations for each user object.

Is there maybe a way to unload a ‘has_many’ association for an object?
Or perhaps call ‘has_many’ inside of a user-specific method or
something? Like

if @current_user.todo_table_number == 1
has_many :todo_items1
elsif @current_user.todo_table_number == 2
has_many :todo_items2
elsif …

Anyway, the reason why I’m trying to stay away from the database-layer
stuff is because they seem to be a bit ‘much’ right now, and also can
apparently cause problems. For example, even adding foreign keys at the
database-layer can evidently cause issues with testing, and sometimes
they get lost in db:push/db:pull calls.

Okay, how about this. Let’s say that instead of my original post, I
posted saying that I have a large table with 7 million+ rows that keeps
growing. I had already split up the tables in the manner mentioned in
the original post. Also, let’s say that I really wanted to stick with
the-Rails-way and was opposed to sharding. What would your advice be
then?

Thanks again for your help,
Mike

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB
system is designed to handle tables in the millions of rows without
breathing hard. Just index your tables properly and make sure your
queries are efficient.

If you have to use your DB’s sharding features at some point, go for it.
But don’t even think of doing this in the app layer.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

On 25 April 2010 00:45, Mike P. [email protected] wrote:


I think if people could just get over the “don’t optimize too early”
mantra, and realize that this can’t possibly be best move for everyone,
a lot of future stress could be avoided, for both the business owner and
the customer.

Optimising “too early” is a bad thing by definition. If it was a good
thing then it would not be “too early”.

Colin

Mike P. wrote:

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB
system is designed to handle tables in the millions of rows without
breathing hard. Just index your tables properly and make sure your
queries are efficient.

If you have to use your DB’s sharding features at some point, go for it.
But don’t even think of doing this in the app layer.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Hi Marnen,

Thank you also for your post and advice. It actually made me think more
about this. So you’re saying that even if I had millions of records in a
table, and was noticing a decrease in performance, that I shouldn’t
split up the tables this way?

Correct, unless there is no better way to improve things (which will
almost certainly not be the case).

Would a shard be the only option?

Depends on the DB.

One of the things that I keep coming back to when I think about the
single-table option is that one of the fields will be rather large, like
a blog-post in size. And I plan on indexing that to make it searchable.
Also, there’s a lot of insertions and deletions. Does it still make
sense to include all of that in a single table? Or does that change any
of your advice?

That does not change any of my advice – right up until the point where
you have actual, measurable problems. And by then, your data model may
be different anyway. Don’t prematurely optimize, because you may be
optimzing the wrong things.

I agree that this is an application layer split, sort of; but I don’t
think that this particular split is a bad thing.

On what basis do you think that? (I think you’re wrong, but I want to
know your reasoning.)

If I did this in pure Ruby, for example, I’m guessing that the effect
would be minimal (just one class/method that looks up the correct table,
and returns the object for that table).

But it is still inappropriate, because this can be handled more
efficiently on the DB side than on the app side.

What’s (potentially) making the
app layer bloated is all of the unused ‘has_many’ attributes since I’m
trying to stick with Rails. I really do want to be able to take
advantage of the nice Rails methods and functionality… I’m just trying
to avoid having a bunch of unused associations for each user object.

Then don’t try to use a silly schema like the one you proposed. Perhaps
you will ultimately need something like MySQL’s MERGE table, but you
don’t need to worry about that yet.

Is there maybe a way to unload a ‘has_many’ association for an object?
Or perhaps call ‘has_many’ inside of a user-specific method or
something? Like

if @current_user.todo_table_number == 1
has_many :todo_items1
elsif @current_user.todo_table_number == 2
has_many :todo_items2
elsif …

Don’t bother.

Anyway, the reason why I’m trying to stay away from the database-layer
stuff is because they seem to be a bit ‘much’ right now,

Right! They are a bit much right now – because they are premature
optimization. When you need them, they will be the right tools.

and also can
apparently cause problems. For example, even adding foreign keys at the
database-layer can evidently cause issues with testing, and sometimes
they get lost in db:push/db:pull calls.

You apparently are passing on FUD you’ve heard somewhere. I assure you
– from experience – that this is completely untrue.

Okay, how about this. Let’s say that instead of my original post, I
posted saying that I have a large table with 7 million+ rows that keeps
growing. I had already split up the tables in the manner mentioned in
the original post. Also, let’s say that I really wanted to stick with
the-Rails-way and was opposed to sharding. What would your advice be
then?

I would advise you to get over your anti-sharding bias and do things as
I outlined above.

Thanks again for your help,
Mike

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Colin L. wrote:

On 25 April 2010 00:45, Mike P. [email protected] wrote:


I think if people could just get over the “don’t optimize too early”
mantra, and realize that this can’t possibly be best move for everyone,
a lot of future stress could be avoided, for both the business owner and
the customer.

Optimising “too early” is a bad thing by definition. If it was a good
thing then it would not be “too early”.

Exactly. You’re trying to justify a bad idea, but it’s still a bad
idea.

No one is saying that you shouldn’t do research on possible future
optimizations. But don’t implement them until you know where your
performance problems are, not where you assume they’ll be.

“Premature optimization is the root of all evil.” --Donald Knuth

Colin

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Thank you Colin and Marnen for your repsonses.

Marnen Laibow-Koser wrote:

Colin L. wrote:

On 25 April 2010 00:45, Mike P. [email protected] wrote:


I think if people could just get over the “don’t optimize too early”
mantra, and realize that this can’t possibly be best move for everyone,
a lot of future stress could be avoided, for both the business owner and
the customer.

Optimising “too early” is a bad thing by definition. If it was a good
thing then it would not be “too early”.

But by whose definition? Someone stuck that “too early” in there, and
it’s an extremely relative and subjective phrase.

If “too early” is anytime before one has noticed a decrease in
performance (which, by the way, probably has to be fairly significant to
be noticeable by trying to use a site), and has measured it, then how
much worse is the performance going to get between that initial
observation and actually implementing the upgrades?

Is it really worth not doing anything about it until we see it for
ourselves, even when we can use the experience of others to lessen the
effect?

Exactly. You’re trying to justify a bad idea, but it’s still a bad
idea.

No one is saying that you shouldn’t do research on possible future
optimizations. But don’t implement them until you know where your
performance problems are, not where you assume they’ll be.

“Premature optimization is the root of all evil.” --Donald Knuth

Okay, but why is it considered an assumption when there are
dozens/hundreds of posts out there about people dealing with performance
issues of growing tables with a large number of rows?

And what about the cases when your research shows that based on the
expectations for the table (lots of insertions and deletions, table gets
increasingly large as new users come in, etc.), that the evidence
suggests that this particular table will have the same issues as the
tables mentioned in the other posts?

It’s just kind of strange to me that there’s no room for upfront
optimizations, not even little ones meant to keep up good performance
for a longer period.

It’s almost like driving a car that you know is going to run out of gas
on a road trip, into a rural area that you’ve never been to before
(which is basically the road of “not knowing how fast something will
grow”). Before you left, you purposefully chose not to pack an extra gas
canister to “get you a little farther” or anything, because the car
would work fine with the amount of gas you had when you left.

But you knew what you have to do when the car does start to run out of
gas: add some fuel.

So you drive and drive, and then you happen to glance down at the
dashboard and notice that the gas light is on. Then, and only then,
you start looking for a gas station. At this point, you get a bit
stressed because you don’t know where you are, and you don’t know how
long the gas will last. Your primary focus is to get more gas. You know
what you have to do, but it will take some time to do it.

So, you just try and keep the car running and hope you get to a gas
station before the car starts chugging or stalls.

The driver knew that the car would run out of gas. He knows of other
situations where people drove their car to that point and ended up
having their car stall (i.e. timeouts and major downtime). The
preventative measure would have been to either get more gas before
he/she left, pack an “emergency” gas canister to help the car get to a
gas station, if needed (i.e. if they didn’t get the database-layer
optimizations in place in time); or plan out the route based on the
known fuel efficiency of the car so that he/she knows exactly when and
where to get gas (a bit complicated, and probably not completely
accurate).

So, I’m trying to get the fuel before starting, not when I absolutely
need it.

I know this is a bit of a loose analogy, but hopefully you see where I’m
going. We can learn from the experience of others. You’ve seen or heard
companies mention “growing pains” on their websites, or in podcasts and
interviews. Why wouldn’t I try and lessen the effect of a known
performance issue?

Does what I’m saying make sense though? I’m just trying to take the
experience of others, learn something from it, and prolong the good
performance of the most heavily used table in the database. This will
help keep the customers happy, and make my experience less stressful in
the future.

If you guys tell me that what I’m trying to do will either

  1. Take a long time, or
  2. Won’t prolong the good performance of the database/index (or even
    have a negative effect on it)

then it would be a different story.

Again, I’m not trying to do something drastic here, I’m just asking for
help on how to handle the multiple-table model at the application layer
in a Rails app, to avoid too many ‘has_many’ associations being loaded
for each User object.

Any advice/tips on that?

Thanks again,
Mike

Colin

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Marnen Laibow-Koser wrote:

Mike P. wrote:

Marnen Laibow-Koser wrote:

Mike P. wrote:

Hello,

I’ve decided to use multiple tables for an entity (e.g. “todo_items1,”
“todo_items2,” “todo_items3,” etc.), instead of just one main table
which could end up having a lot of rows (e.g. just “todo_items”). I’m
doing this to try and to avoid a potential future performance drop that
could come with having too many rows in one table.

Robert already said it, and I agree: DO NOT DO THIS. Any decent DB
system is designed to handle tables in the millions of rows without
breathing hard. Just index your tables properly and make sure your
queries are efficient.

If you have to use your DB’s sharding features at some point, go for it.
But don’t even think of doing this in the app layer.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Hi Marnen,

Thanks again for your response. I was thinking about it last night, and
did some more research and a few calculations, and I’m inclined to agree
with you and stick with the single table model. It will make things
easier and the database should be able to handle it for a while.

However, I’m going to reply below as if I’m sticking with the
multi-table model, mostly to see how I really feel about it by typing
some of my thoughts (it sounds kind of weird, but hopefully it makes
sense).

Honestly, I still think that splitting the main table into smaller ones
will have future benefits. Even though there’s a bit of extra work up
front, I’d rather get this small thing over with rather than letting it
turn into a bigger thing, sooner, in the future.

I think if people could just get over the “don’t optimize too early”
mantra, and realize that this can’t possibly be best move for everyone,
a lot of future stress could be avoided, for both the business owner and
the customer.

Personally, I’d probably change that saying to read “don’t optimize too
much
too early.” As they say, an “ounce of prevention is worth a pound
of cure…” it makes sense in the database optimization world too. :slight_smile:

This multi-table thing isn’t meant to completely avoid the need to do
further optimizations, by the way, and I’m aware that I shouldn’t go
overboard with too much upfront optimization. I know that future, most
likely database-layer, performance tweaks will probably be needed later.
But it’s because I know this that I was thinking about an easy-to-do
change that I could do now, that will keep the performance better, for
longer.

Anyway…

Hi Marnen,

Thank you also for your post and advice. It actually made me think more
about this. So you’re saying that even if I had millions of records in a
table, and was noticing a decrease in performance, that I shouldn’t
split up the tables this way?

Correct, unless there is no better way to improve things (which will
almost certainly not be the case).

Would a shard be the only option?

Depends on the DB.

It’s PostgreSQL running on a single database server, in case you were
wondering.

One of the things that I keep coming back to when I think about the
single-table option is that one of the fields will be rather large, like
a blog-post in size. And I plan on indexing that to make it searchable.
Also, there’s a lot of insertions and deletions. Does it still make
sense to include all of that in a single table? Or does that change any
of your advice?

That does not change any of my advice – right up until the point where
you have actual, measurable problems. And by then, your data model may
be different anyway. Don’t prematurely optimize, because you may be
optimzing the wrong things.

I see your point here. I don’t think the model for this table will
change much though (I know, I know), but I do see your point. However,
from the other side, even if the model does change, I could always just
update all of those tables with the migration(s), which I’d still have
to do with the one-big-table model.

I agree that this is an application layer split, sort of; but I don’t
think that this particular split is a bad thing.

On what basis do you think that? (I think you’re wrong, but I want to
know your reasoning.)

Well, pretty much because it’s not much different than accessing the
rest of the tables in the database.

When I want to look up a user, I go to the User table; I may have
another table for user preferences which I’d also have to access.
Looking at todo items for a user involves me looking at a single table
in both cases, the main difference between the multiple-tables vs
single-tables is that there’s an extra bit of indirection there.

It could be as simple as:

  1. Looking at the “which_todo_table” column (which would be a number,
    I’m thinking)
  2. Appending that number to the end of a string (so, “todo_items” << “3”
    becomes “todo_items3”)

So, there’s an extra column in the User table that tells which table to
look at to get that user’s todo items, and the app layer chooses that
database when it needs to.

I agree that the database can handle this sort of split more efficiently
than the application layer, especially when it comes to things like
sharding or partitioning; but I’m looking at the multi-table model as a
sort of “soft-partitioning.” The data gets “split” up in the database
(via tables), and the indexes stay at about a fifth of the size of the
single-table methodology. Similar to partitioning, but without modifying
the database in a way that Rails doesn’t know about.

So, the difference is that I’d be using Ruby/Rails to “decide” which
table to look at, instead of having a pre-set constant that’s stored by
Rails and used whenever a model is accessed.

With this, as a benefit, everything’s at the application layer. I can
see exactly what’s going on, and if I move servers or something, I won’t
have to deal with the potentially complicated, database-specific
sharding stuff that can only be seen by investigating the database
layer.

It’s just another table. That’s how I’m looking at it.

If I did this in pure Ruby, for example, I’m guessing that the effect
would be minimal (just one class/method that looks up the correct table,
and returns the object for that table).

But it is still inappropriate, because this can be handled more
efficiently on the DB side than on the app side.

I agree, but it’s also more complicated to have to set up the database
to manage all of that stuff. Handling things on the DB side adds
another “moving part,” so to speak. It will have to be managed, tested,
and revisited when changes are being made.

Don’t get me wrong, I’m not opposed to DB-layer tweaks and
modifications, but if I can avoid/delay complicated DB-layer changes by
doing some work upfront - especially in this case where the modification
isn’t database-level --, then I’ll at least look into doing so.

The other main benefit is that DB performance will be much more
resilient to surges of data. I’ll have more time to implement those
DB-level changes, if/when needed, and the decrease in performance will
be much more gradual.

Considering that making a change such as this has a such a
(theoretically) small impact on the app-layer codebase, has no effect on
the database itself, and allows the database/app to perform better for
longer - thereby delaying the slowdowns and the need for potentially
complicated, not-easily-portable changes - isn’t it worth looking into?

You must have seen people who are noticing their database performance
dropping, and who are scrambling to fix their issues. They pop up in
forums quite often, no? And how do they fix it? With more servers and/or
complicated procedures; with data migrations, extra layers, and then
downtime. All the while, they’re feeling pressure because the
performance lag is noticeable, and getting more noticeable as time
passes.

It happens so often, everyone knows that it’s going to happen, and yet
trying to optimize a bit upfront is discouraged.

And, of course, in every DB performance upgrade push, there’s another,
even more important part of the story here: the users.

For however long it takes to get these changes implemented, and
sometimes from even before the changes start, the users are noticing
these slowdowns, and possibly even time-outs. They’re seeing notices on
the site about how the company is working to make the site faster, blah,
blah, blah. They don’t look at it as “oh, yeah, they followed the
hard-and-fast rule about not optimizing too early, there’s nothing they
could’ve done to prevent this, so it’s okay, this is just what happens.”
Instead, it looks like they didn’t plan this properly, or that they
weren’t prepared.

So then, my question is, why – if people can expect these performance
issues to happen as a site grows – why is it so frowned upon to try and
do a little bit of upfront work to delay that from happening… and also
when it starts to happen, to have better performance during that period?

What’s (potentially) making the
app layer bloated is all of the unused ‘has_many’ attributes since I’m
trying to stick with Rails. I really do want to be able to take
advantage of the nice Rails methods and functionality… I’m just trying
to avoid having a bunch of unused associations for each user object.

Then don’t try to use a silly schema like the one you proposed. Perhaps
you will ultimately need something like MySQL’s MERGE table, but you
don’t need to worry about that yet.

Not yet, true. But it’ll come. So why not do something about it? In that
case, do you have any other recommendations for something small that can
be done upfront to keep performance up in the future? I’ve already got
the indexes covered… anything else? Or should I seriously just do
nothing but the indexing?

Is there maybe a way to unload a ‘has_many’ association for an object?
Or perhaps call ‘has_many’ inside of a user-specific method or
something? Like

if @current_user.todo_table_number == 1
has_many :todo_items1
elsif @current_user.todo_table_number == 2
has_many :todo_items2
elsif …

Don’t bother.

Back to my question: why not? It’s just a small preventative measure.
Mind you, if I was attempting to do some multi-server sharding right
now, then you’d be totally right.

Anyway, the reason why I’m trying to stay away from the database-layer
stuff is because they seem to be a bit ‘much’ right now,

Right! They are a bit much right now – because they are premature
optimization. When you need them, they will be the right tools.

But not all premature optimizations are the same. Would you tell someone
not to worry about indexing until they get to a point where they have
the metrics (i.e. a few hundred or thousand users/rows)? The database
will work well until they get a few hundred/thousand rows without
indexing, yet people are told to take care of that upfront.

This isn’t a super-crazy thing I’m trying to do here, just some
optimization that would help at the million+ mark, instead of at the
thousand+ mark.

and also can
apparently cause problems. For example, even adding foreign keys at the
database-layer can evidently cause issues with testing, and sometimes
they get lost in db:push/db:pull calls.

You apparently are passing on FUD you’ve heard somewhere. I assure you
– from experience – that this is completely untrue.

Well, it’s not hearsay if that’s what you’re saying. I could have been
reading something that doesn’t apply to Rails 2.3.5 or Rails 3, but one
of them was from someone using a specific host and, after doing a
db:push or db:pull, the foreign keys that he setup were gone. The issue
was due to a plugin that didn’t support foreign keys.

As for the testing issue, it’s something to do with the fixtures not
being loaded in the proper order, and as a result, the Test database
wasn’t getting setup properly. That looked like it needed a patch or
something to fix it, so maybe it wasn’t a super big deal.

Besides that, as far as I know (and I could be wrong) Rails doesn’t
support foreign keys out of the box. That’s why plugin’s like Foreigner
exist. I don’t have the links for these offhand, but if you’d like them,
I can go back and find them. I read about both of these within the past
week.

But this is besides the point…

Okay, how about this. Let’s say that instead of my original post, I
posted saying that I have a large table with 7 million+ rows that keeps
growing. I had already split up the tables in the manner mentioned in
the original post. Also, let’s say that I really wanted to stick with
the-Rails-way and was opposed to sharding. What would your advice be
then?

I would advise you to get over your anti-sharding bias and do things as
I outlined above.

Well, it’s not so much an anti-sharding bias, it’s an attempt to keep
the app/database cleaner for longer, to keep the app/database performing
better for longer, and to avoid upsetting customers with decreased
performance by giving myself some additional time to make any required
drastic changes like sharding.

Sharding and other complex/time-consuming optimizations should indeed be
left until the metrics show that it’s going to be needed soon. I agree
with that part. But sharding is also a few orders of magnitude more
complex then what I’m trying to do.

I certainly don’t agree that every optimization should be supported by
data and held off until then, especially when we all know what happens
when database performance goes down.

Waiting until a known potential problem (i.e. large table performance
drops) is actually a problem has consequences. I’m just trying to delay
it. Splitting up the tables will work, and it’s not that complicated and
keeps the supporting code at the application level.

So, any best practices or tips for how to support/implement this
multiple table model would be great, and super appreciated.

Thanks again for your help,
Mike

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Thanks again,
Mike

On 26 April 2010 02:08, Mike P. [email protected] wrote:

Optimising “too early” is a bad thing by definition. If it was a good
thing then it would not be “too early”.

And what about the cases when your research shows that based on the
expectations for the table (lots of insertions and deletions, table gets
increasingly large as new users come in, etc.), that the evidence
suggests that this particular table will have the same issues as the
tables mentioned in the other posts?

If you absolutely know that there is going to be an issue with the
performance then it is not premature optimisation, it is part of the
specification of the app. If however, as you say, “the evidence
suggests that this particular table will have the same issues …”
then it is only a suggestion and not a certainty, so what is the point
of doing up-front work delaying initial deployment? Particularly when
the suggested optimisation may not be the best way of solving the
problem (which may never occur).

A long career in s/w development has shown me that the key performance
issues in an app are very rarely in the areas that one expected at the
start.

Colin

Michael P. wrote:

On 26 April 2010 02:08, Mike P. [email protected] wrote:

So, I’m trying to get the fuel before starting, not when I absolutely
need it.

I think it seems to some that what you might be doing is filling the
back seats of your car with jerry cans full of fuel when all you’re
doing is popping down the road to pick the kids up from school.

Good analogy. And the weight of those jerricans will probably reduce
your fuel efficiency.

But you have put your finger on the issue yourself - the “too soon” is
subjective; if it’s not too soon for you, then it’s perfectly
okay…

No. It’s not subjective. If the optimization is speculative, don’t do
it!

As to the issue of waiting till users are affected, you don’t have to.
It is possible to keep an eye on performance stats, see a problem coming
before your users notice, and fix the problem.

but there’s been a few posts here (and I agree with them)
that what you describe as the application’s purpose does not wave a
flag for something that is in desperate need of DB sharding in the
development phase…

Right. And also the fact that DB sharding belongs in the DB and should
not be reinvented in the app layer. (More on this in another post.)

but only you can know based on your
requirements and measurements, and hopefully you take the suggestions
as constructive input rather than as criticism.

And you can’t know. The app doesn’t exist yet, so you don’t know what
your usage patterns will actually be.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

On 26 April 2010 02:08, Mike P. [email protected] wrote:

So, I’m trying to get the fuel before starting, not when I absolutely
need it.

I think it seems to some that what you might be doing is filling the
back seats of your car with jerry cans full of fuel when all you’re
doing is popping down the road to pick the kids up from school.

But you have put your finger on the issue yourself - the “too soon” is
subjective; if it’s not too soon for you, then it’s perfectly
okay… but there’s been a few posts here (and I agree with them)
that what you describe as the application’s purpose does not wave a
flag for something that is in desperate need of DB sharding in the
development phase… but only you can know based on your
requirements and measurements, and hopefully you take the suggestions
as constructive input rather than as criticism.

On 26 April 2010 13:16, Marnen Laibow-Koser [email protected]
wrote:

And you can’t know. The app doesn’t exist yet, so you don’t know what
your usage patterns will actually be.

I agree totally… but I was being nice to the OP; and he may prefer to
make his own mistakes to learn from (or not learn :slight_smile:

Thank you all, so very much, for your great, thoughtful posts and solid
advice. You all have excellent points. I’ve decided to take your advice
as wise council, and stick with the single-table model. :slight_smile:

Hi,

if you find you DO have to shard in your application, then I used
the approach:

todo_items_<user_id>

and created a base level todo_items (which held no values) and created
the tables todo_items_<user_id> with the MySQL statement:

create table todo_items_<user_id> like todo_items

Migration can be a bit of a pain though!

Once you know the user_id in your controller the you can just do

TodoItem.set_table_name “todo_items_#{@user.id}”

Just my 2 pennies worth

Allan

Marnen Laibow-Koser wrote:

Michael P. wrote:

On 26 April 2010 02:08, Mike P. [email protected] wrote:

So, I’m trying to get the fuel before starting, not when I absolutely
need it.

I think it seems to some that what you might be doing is filling the
back seats of your car with jerry cans full of fuel when all you’re
doing is popping down the road to pick the kids up from school.

Good analogy. And the weight of those jerricans will probably reduce
your fuel efficiency.

But you have put your finger on the issue yourself - the “too soon” is
subjective; if it’s not too soon for you, then it’s perfectly
okay…

No. It’s not subjective. If the optimization is speculative, don’t do
it!

As to the issue of waiting till users are affected, you don’t have to.
It is possible to keep an eye on performance stats, see a problem coming
before your users notice, and fix the problem.

but there’s been a few posts here (and I agree with them)
that what you describe as the application’s purpose does not wave a
flag for something that is in desperate need of DB sharding in the
development phase…

Right. And also the fact that DB sharding belongs in the DB and should
not be reinvented in the app layer. (More on this in another post.)

but only you can know based on your
requirements and measurements, and hopefully you take the suggestions
as constructive input rather than as criticism.

And you can’t know. The app doesn’t exist yet, so you don’t know what
your usage patterns will actually be.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Thank you all, so very much, for your great, thoughtful posts and solid
advice. You all have excellent points. I’ve decided to take your advice
as wise council, and stick with the single-table model. :slight_smile:

Thanks again everybody!

Mike