Multiple customers - keeping the data separate - how?

Phlip wrote:

So you have wall-to-wall unit tests, right?

No, and neither do you. Even if you think you have :wink:

NeilW

Neil W. wrote:

Phlip wrote:

So you have wall-to-wall unit tests, right?

No, and neither do you. Even if you think you have :wink:

I can add an ‘assert(false)’ to any block in my program, and tests will
get to it.

(The remaining useless debate centers on a useful definition for “wall
to wall”.)

The more security you want, the more tests you need.


Phlip

“Neil W.” [email protected] wrote:

I’m convinced this can be done in a simple and effective manner.

I’m sure it can. Based on your comments you seems to be going for a
coarser-grained solution compared to the problem that I’m trying to
solve.

Doing this through DNS implies a number of things:

  1. You can wait for DNS propagation (assuming you’re talking the
    Internet and not an Intranet).
  2. You have the liberty to create a separate application cluster per
    customer (all using essentially the same code base, with a config per
    customer).

Assuming this is true for you then, I would agree that you can do this
pretty simply.

My situation is different.

I need quick provisioning for new customers (on the order of seconds
to a couple of minutes), and I cannot consider automatic provisioning
of a new application cluster each time for a new customer.

In my situation I have to be able to share application clusters
(running on many machines) with a number of databases on the back end.

To CWK’s points:

I’m not thinking that building a multiple tenant by breaking up the DB
by tenant is trivial in my case, but it is also not as dire as you
portray is. But I generally agree with you: I would like to have
everything in one DB for maintenance, but reality is forcing my hand.

Firstly: the application I’m working on is a live Internet application
and is already database limited. Performance of the middleware is not
even remotely a factor. Our Web servers are basically asleep. So my
primary concern is scaling the DB layer.

We’ve already investigated a number of possible cluster/federation
schemes and they do not scale nearly as well as the vendors would like
you to believe. In our tests data partioning per customer has given,
by far, the best overall performance boost.

My comment about big number is this: No matter what your DB solution is
there is some number of aggregate rows in the DB where performance will
diminish “quickly”. Generally speaking, you go about and index your
data to get better (read) performance, but indexing provides the maximal
benefit when either the indexes of all your hot tables can fit in RAM
or result in very few hits to disk. However, as time goes on fewer and
fewer of your indexes will fit in RAM, and even your less hot tables
and indexes become significant. Your DB starts to become disk bound.
“Disk bound” is one foot in the grave. So what to do? I can’t ignore
it.

Right now we have a logically partitioned customer data in the same
DB. All living cozy inside the same schema. This does make many
things easy, but it makes scaling VERY hard.

At large numbers things start to behave very differently as
off-the-shelf solutions kind of cease to work, period.

Well, I think that you would agree that this is hyperbole. Off the
shelf solutions, RDBMSes you mean, are clearly not the fastest things on
the planet, but they keep the data mailable, I’m not aware of
alternatives that have both the relative ease of data manipulation of
RDBMSes and the reasonably good performance they posses. (Ah, perhaps
Google has some goodies in-house, alas, I’m not Google)

Keeping RDBMSes operating at a healthy level can be done for a long
time, but you will eventually need to give up some comfort. In this
case the “all-in-one-db” approach.

This is a !@#$-load of tricky plumbing to avoid setting and watching
client entitlements on database rows. Let alone the havoc this could
wreak with managing the database–depending on which one you use, this
could complicate how you deal with tablespaces and such.

No doubt that there is some plumbing that needs to be put into place.
But I’m doing this to gain performance primarily. I gain the
performance on two levels: the DB server in question is operating on
relatively “small” DBs meaning intrinsically improved performance,
then there are logical performance improvements I can make. Right now
I have to do checks on security for the “owners” as well another users
(our application allows clients to publish their data to other users of
our system ans well as the general public). Once an owner (or
assistants, who have the same privileges as the owner) has logged in I
need perform no checks on access to their data.

  • Incremental application migration

This is beneficial if you want to maintain multiple software versions.
If that’s the case then you might as well just install complete app
instances per client and be done with it. Been there many times, will
never do it again unless building something like an ERP where it still
makes some sense.

Agreed, this can be a pain. I wasn’t referring maintaining an
arbitrary number of versions of the app, just no more than two at once:
old and new. This situation would only be temporary, while a roll-out
was occurring.

  • Overall better performance

TANSTAAFL. If your database server is running twenty database
instances, there is going to be some kind of performance hit to that
versus one DB with tables 20 times larger. The overhead associated
with connection pools and query caches et. al. could in many cases be
much larger than the hit to scanning tables 20 times longer. I just
don’t accept this as an open-shut benefit right off the bat.

The benefit come from the fact that I can 1 or more DB instances on a
given DB server. How I want to tune performance is entire up toy me.
In fact what ever tuning trick you can do in a single Db instance to
gain performance I can do with a one DB per customer setup, but the
converse is not true: there are things that can be done in a one DB per
customer configuration that cannot be done in a single DB approach.
I’m not saying that I can accomplish this easily, but it can be done.

  • The ability to manage performance better (one big hot client can
    be moved to their own Db server)

There’s no reason you can’t do this with a multi-tenant system too.
For that matter you can run a special client on their own complete
system instance with no or very little fancy plumbing.

In my case the DB, not the web application is the problem. Otherwise,
yes, I agree with you.

Not to mention that you may find (as I did) that clients want/like
human-readable backups, not SQL dumps.

But with a per client DB approach I get the ability to backup and
restore data in a per client basis, a far more regular occurrence.
And I can do this using high-performance tools without writing
anything (except my app to work like this). As far as the “human”
readable part I could also write such a script. Also, in case you
haven’t tried serializing with Ruby is REAL SLOW, loading the data
with Ruby is no race winner either. Clearly no one has to be confined to
using Ruby to do this. But yet again the per client DB approach wins for
flexibility out of the box.

I still think the “not trivial” aspect understates it by two-thirds.

Fair enough.

operate of all the approaches I’ve been involved with.
Possibility, but I generally doubt it.

Well like I said above I agree that it poses certain challenges–you
end up needing to build a high-performance application even though all
your customers are 5-seat installations. I do agree that this is
ultimately probably an issue best solved in the database, but I’m not
sure that the approach posited here isn’t trading getting stabbed for
getting shot.

In my case (unlike Neil) I’m doing this explicitly for the performance
benefits I can attain.

Your warnings have been heard, I will take them into account. Much
appreciated.

Jim P.

Coming in a bit late here… This is an issue we have had for quite a
while, as we store financial data and it absolutely cannot get mixed
up. IMO this is one area where some logic should go in the database,
and the easiest solution is using a database that gives you the right
tools. You can absolutely keep client data separate and have it all
in one database and normalized by using functions and views, at least
with databases like postgresql and oracle. We make a few adjustments
here and there, such as not being able to use some of the AR methods
for inserts and updates, but a small library of custom methods is a
whole lot easier then having hundreds of databases.

A bigger issue is proper testing and good change management habits.
Most bugs I see in working production systems is when some developer
gets the itch to upgrade something or fix an existing bug and pushes
it into production without adequate testing. The other leading thing
would be making too many changes over too short a time period. If
reliability and data integrity are at the top of your list, then the
fact is you just have to be more conservative with how often you
change stuff or add new features. You can have the best system in the
world for keeping your client data separate, but if your people have
bad habits it won’t matter.

Chris

We’re actually planning on doing something exactly like this. In our
case, each tenant represents an institution, with up to hundreds of
users. We’re not concerned about quick provisioning of new tenants -
signing one up and migrating them is a large, manual process
regardless. Having one db, one OS user, and one domain name per tenant
simplifies a lot.

  • Frees the code from having to track tenants. Otherwise, every row
    would need a tenant_id, and every find would have to scope to
    tenant_id
  • Built in, bullet proof data partitioning
  • Ability to move tenants to separate servers, or let them host their
    own

Implementation is straightforward:

  • All tenants run off the same codeline
  • The codeline is checked out once - to one place
  • Each tenant has their own environment - which is identical except for
    the db
  • and their own domain name, which is tenantsname.ourapp.com
  • deployment / new versions are run by script against all the dbs /
    mongrels
  • one mongrel per tenant - all running off the same code dir - but in
    different env, and different db

Now, the only thing which really concerns me is the fact that we’re
stuck with 1 Mongrel per tenant. With a lot of tenants, and each
mongrel using 20-50MB of memory, that could get ugly. Its possible
that the swap file will handle all of this - swapping out Mongrels
belonging to tenants that aren’t online - but this won’t help much, as
during peak times nearly every tenant will be using the system.

One very simple solution to this would be to mod ActiveRecord to not
use persistent database connections. This could be something as simple
as an around_filter, establishing a connection to the appropriate db,
and tearing down afterwards. This would let all of our mongrels be
used for any tenant.

Persistent db connections aren’t necessarily that helpful, anyway.
With MySQL, for instance, on a LAN, they’re hardly noticable. I know
that for SQL Server, MS stopped recommending them, as well. They’re
feeling being that if you have a lot of apps on 1 db, it’s better for
each one to hang up when they’re done. Better the overhead of
connect/teardown than of keeping numerous dormant connections.

Another possible concern I have is session collision - although I’m not
sure if this is even possible - I need to investigate the different
ways Rails handles sessions.

Last, we’ve already written a little code to help with some of the
unique db issues - enforcing that only one tenant ever uses one db.

Neil, if you or anyone else is interested on collobarating to help make
the scripts and tools needed to make this a reality, please speak up.
(Please keep posts to list, not private email.)

Neil W. wrote:

I see this as akin to an operating system. Let’s get processes working
and see how they handle things before we invent threads. It may be that
Moore’s Law rides to the rescue again.

I want to see how this works. Let’s build it, but let’s build the
simplest thing that will work first - total separation and a separated
tenancy provisioning system.

Agreed. Consider the project started. And with the motto “make the
simplest thing that could possibly work”.

I think the first task is expand capistrano to be able to tell it to
run one task for a list environments. Migrate all the environments,
restart all the mongrels, take 'em all down.

SCM check outs remain the same - we’ll use one SCM branch for all the
instances.

I’m also working on a simple tool for cron / daemon jobs - again, one
cmd to start/stop them all for all of the environments.

I can understand the desire to try and get the Mongrel count down, but
the worry I have with reusing Mongrels is that the objectspace is
potentially polluted with ActiveRecord data from a previous tenant. I
don’t want the added complexity of database separation and then find
that the separation has broken down because I’m recycling Objectspaces
and there is a cyclic graph in my object hierarchy keeping old AR
instances out of the clutches of the garbage collector.

I see this as akin to an operating system. Let’s get processes working
and see how they handle things before we invent threads. It may be that
Moore’s Law rides to the rescue again.

S. Robert J. wrote:

Neil, if you or anyone else is interested on collobarating to help make
the scripts and tools needed to make this a reality, please speak up.
(Please keep posts to list, not private email.)

I want to see how this works. Let’s build it, but let’s build the
simplest thing that will work first - total separation and a separated
tenancy provisioning system.

NeilW

You can deal with a lot of your application security by just using
associations correctly.

A before filter sets a user object based on the value in session.

@user = User.find session[:user]

In ProjectController, the list method is something like this

def list
@projects = @user.projects
end

There’s simply no need to worry about screwing up the relationships as
long
as you track what user owns things. When you save the data, make sure
you
save the owner of that record on every table and then let the
relationships
work themselves.

Worried about extra database hits? Then use eager loading where
appropriate.
Use a before_filter for the project controller that eager loads the
projects
for a user. Maybe even load more stuff. Or create methods on the user
object to do your loading.

Neil W. wrote:

Unfortunately the application I have in mind involves account data, and
I can’t afford a bug in an application exposing one customer’s data to
another. I need something more substantial than that. (And there are
other reasons - such as backup). However I still want to share physical
infrastructure.

My thoughts are that there should be a URL per customer driving their
own mongrels locked onto their own version of the database. However the
standard infrastructure support tools don’t support that way of doing
things.

This seems crazy to me.

Amazon manage to be very secure and they certainly don’t have one
database or appserver per client. I can’t imagine many (any?)
service-based webapps that would do this.

If application level security is good enough for, say, my bank or ebay,
is it really not enough for you ?

A.

p.s. I spotted some reference to 37s later in the thread, they’ve tended
to use a separate URL per client client1.backpackit.com /
client2.backpackit.com but all this does is tell the app what subset of
data to restrict to (ie an additional join clause). It’s not an app
instance per client.

May be the solution would be to use a virtualization system ? Like the
one that is available under Linux (Xen)

  • initial virtual partition can be software prepared in a minute
  • each virtual machine is insulated from the others
  • you don’t have to fear maintaining dozens of real servers
  • backup once backup all
  • lower costs
  • versioning mechanism of the virtual partitions (quite instant rollback
    in case of failure, just after a big maintenance task for example)

If you can’t afford dns propagation, use one tcp port for each client on
the frontend then forward them to each virtual server

My 2 cents…

Alan C Francis wrote:

My thoughts are that there should be a URL per customer driving their
own mongrels locked onto their own version of the database. However the
standard infrastructure support tools don’t support that way of doing
things.

This seems crazy to me.

Amazon manage to be very secure and they certainly don’t have one
database or appserver per client. I can’t imagine many (any?)
service-based webapps that would do this.

I agree. I’ve seen two approaches:

  1. Perform scoping in-database by using views and triggers. A stored
    procedure is used to set up the views for the specific customer or user.

  2. Perform scoping in the application. We’ve been using around_filter in
    Rails to wrap entire controllers in a with_scope. However, reading
    recent threads on Rails-core, with_scope will go protected which will
    make this approach extremely impractical.

I’m no fan of option #1 because it’s behavior isn’t explicit or
traceable. From experience I know that even (e.g.) PostgreSQL itself
doesn’t like that black box – it’s query planner just fails to perform
necessary optimizations that would otherwise have been obvious.

Seeing how my idea of going about option #2 is going to be deprecated in
Rails, I share your curiosity as to what is the optimal solution.
Starting every single action with a with_scope sure may be traceable but
its repetition seems greatly inefficient.

Very interested to hear your ideas!

  • Roderick

Cw K. wrote:

This is actually something we’re revisiting to see if there’s a better
way as we are looking to allow clients to define custom entitlement
schemes. My experience, at least in a B2B environment, is that
entitlement schemes always become more complex over time. Part of me
doubts whether there is a good generalized approach to this at the
framework level.

Good point that may not be far from the truth. After all, such schemes
are not the common case and so frameworks may not provide in them.

I too am in favor of doing it in-application. But truth be told, we have
a system with an equally complex authorization scheme (ACLs based on
role, division and subdivision) and we’re doing that rather successfully
in-database. It’s even passed the test of evolution as the schemes
indeed grew more complex.

  • Roderick

Roderick van Domburg wrote:

  1. Perform scoping in the application. We’ve been using around_filter in
    Rails to wrap entire controllers in a with_scope. However, reading
    recent threads on Rails-core, with_scope will go protected which will
    make this approach extremely impractical.

I’m on therecord here as preferring the in-application approach for
reasons as already stated.

In our case, the world of possible actions is too complex to make a
simple filtering security model practical. In our case, we have not only
clients to worry about, but user groups and individual user permissions.
Determining the list of allowable actions for User A at Point B involves
a number of tests.

This is actually something we’re revisiting to see if there’s a better
way as we are looking to allow clients to define custom entitlement
schemes. My experience, at least in a B2B environment, is that
entitlement schemes always become more complex over time. Part of me
doubts whether there is a good generalized approach to this at the
framework level.

nuno wrote:

May be the solution would be to use a virtualization system ? Like the
one that is available under Linux (Xen)

I see Xen as part of the solution, but not in the way that you imagine.

NeilW

Roderick van Domburg wrote:

Seeing how my idea of going about option #2 is going to be deprecated in
Rails, I share your curiosity as to what is the optimal solution.
Starting every single action with a with_scope sure may be traceable but
its repetition seems greatly inefficient.

Just keep on doing. You don’t have to agree with the core - you can
just send(:with_scope, params).

But, even better: it’s protected, not deprecated. Define a method
with_scope_for_user(user) in your model, mark it public, and have it
call with_scope. That’s much better anyway.

S. Robert J. wrote:

Agreed. Consider the project started. And with the motto “make the
simplest thing that could possibly work”.

I think the first task is expand capistrano to be able to tell it to
run one task for a list environments. Migrate all the environments,
restart all the mongrels, take 'em all down.

That depends how you separate the tenants. If you make a tenant a Unix
user, then the job is (potentially) trivial:

for word in cat list_of_tenants; do cap -s user=$word -a update; done
for word in cat list_of_tenants; do cap -s user=$word -a restart;
done

I’m also working on a simple tool for cron / daemon jobs - again, one
cmd to start/stop them all for all of the environments.

Again in theory if you make a tenant a Unix user, then the cron jobs
all run in the user’s crontab in the user space, and so do all the
daemons for that tenant. So restarting them just needs a dose of
‘killall’ and a script running while running as the correct user.

You can user the @reboot facility of cron to bring the Mongrels up for
a tenant when the machine starts, and a daily cron entry to restart
them to keep memory under control.

I like the idea of tenant = Unix user. It has a certain conceptual
charm to it, and if I can make it work it gives me a ton of leverage of
the base Unix tools.

Barking?

NeilW

Cw K. wrote:

Part of me

doubts whether there is a good generalized approach to this at the
framework level.

Does a tenant ever need to see another tenants data in a manner that
couldn’t be achieved simply by giving an individual a user id in both
tenant’s user list?

You see I still see the user list, group list, access control lists and
authentication/authorisation role system within an application space.
You have to do that and the structure is indeed different and evolving
for every application there is.

But the tenant can be removed to framework level, cos a tenant is just
a good old fashioned user at infrastructure level and half the job is
already done by the standard Unix user tools.

You’ve got to admit that

rake remote:exec ACTION=“invoke” COMMAND=“adduser new_tenant”
SUDO=“yes”
cap -s user=new_tenant -a cold_deploy
rake remote:exec ACTION=“invoke” COMMAND=“invoke-rc.d apache2 reload”
SUDO=“yes”

has a certain succinct charm to it. I wonder how close to this ideal I
can get and how much it costs in real terms?

NeilW

Two great articles discussing exactly this:

On 12/12/06, S. Robert J. [email protected] wrote:

Two great articles discussing exactly this:
Microsoft Learn: Build skills that open doors in your career
Microsoft Learn: Build skills that open doors in your career

One thing I would add to this is that even when using separate
databases or schema’s, it pays to design your tables as if the data
was all in one database/schema.

Also as an FYI for those that are interested. We spent a good amount
of time working on different ways to use rails in an environment where
user data was separated by schema’s. One thing that’s worked fairly
well is the set_table_name method, which can be used to set the
schema.tablename at the start of each request. At a slight hit in
performance we actually do something like the following:

  • Start of request
  • set_table_name ‘schema.table’
  • Do stuff
  • set_table_name ‘none’

V. Interesting. Thanks for that.

BTW You’ll be glad to hear that the Multi-tenant system is progressing
(at snail’s gallop, but at least it’s moving forward). I have a
brittle proof of concept up on a Debian Etch Xen platform.

One of the interesting side effects of using Capistrano to deploy code
once per tenant is that file system sessions suddenly scale rather
well.

Since multi_tenant is built entirely as a set of Capistrano recipes
and plugins I’ll probably run any posts on the Capistrano group rather
than here - where it may get lost in the noise.

Stay tuned

NeilW