Reducing downtime

addis_a · July 3, 2013, 4:11pm

We have our RoR apps on cloud servers that can be blown away and rebuilt
with the aid of chef. The only volatile data on the server is the
database
which we backup periodically and pull off to our backup server.

If we backup the database once a day then should things go wrong and we
restore the database from backup we will be, on average, 12 hours out of
date. At worst (if the system crashes just before the backup is made) we
could be 24 hours out of date.

To reduce the time we lose when we restore a backup we could of course
just
take more frequent backups. But this would start to become a performance
issue, not to mention pulling the backups off the server will start to
eat
into our bandwidth.

The final additional problem is that we are running >25 different
applications, each with their own database (some small, some huge). We
tried replication but we seemed to need a real monster of a server to
replicate to than we could afford and we can’t afford to run >25
duplicate
servers either.

Has anyone tackled this problem?

I was imaging an extension to active record that created a journal / log
file of changes such as:

UPDATE Event#1234 start_time=“2013-07-03”
UPDATE Event#3453 status=“completed”, finished_at=“2013-07-03 12:15:00”
DELETE Ticket#2222
CREATE Ticket#3453 status=“opened”, owner=“fred”

Which could be rerun once the database had been updated to catch up with
the changes that happened since the backup was made.

I am assuming that any idea that I have someone else has already had and
probably implemented (unless it’s completely insane - in which case some
indication as to why it wont work would be nice).

Any thoughts?

tekhne · July 3, 2013, 9:25pm

Peter H. wrote in post #1114323:

To reduce the time we lose when we restore a backup we could of course
just
take more frequent backups. But this would start to become a performance
issue, not to mention pulling the backups off the server will start to
eat
into our bandwidth.

Any thoughts?

A robust solution is to use database replication and then run your
backups off the replica, This way you have at least two servers that are
kept synchronized constantly, Backups of the replica will also not
impact the performance of the application accessing the source database.

http://dev.mysql.com/doc/refman/5.0/en/replication.html

tekhne · July 4, 2013, 12:44pm

On 3 July 2013 20:25, Robert W. [email protected] wrote:

A robust solution is to use database replication and then run your
backups off the replica, This way you have at least two servers that are
kept synchronized constantly, Backups of the replica will also not
impact the performance of the application accessing the source database.

We tried that, we use PostgreSQL and londiste, but when it decided to
break
it stayed broken. I spent weeks trying to get things to work again once
it
failed. That was only 2 databases, what would happen for the >25
databases
that we have?

Unless someone dumps a large amount of cash in my lap I have to improve
the
situation we are in with what we have to hand rather than throwing money
around.

tekhne · July 4, 2013, 1:10pm

On Thu, Jul 4, 2013 at 5:43 AM, Peter H.
[email protected] wrote:

We tried that, we use PostgreSQL and londiste, but when it decided to break
it stayed broken. I spent weeks trying to get things to work again once it
failed. That was only 2 databases, what would happen for the >25 databases
that we have?

Unless someone dumps a large amount of cash in my lap I have to improve the
situation we are in with what we have to hand rather than throwing money
around.

You really only have two options here if you ask me, learn to do live
snap-shots so you can roll-back or learn to deploy early and often
with small changes that will not drastically break anything and can be
easily rolled back if something bad happens. The latter is what some
companies do, the former is a complicated mixture of how do you want
to go about it, and are we doing it right.

tekhne · July 4, 2013, 1:20pm

On 4 July 2013 11:43, Peter H. [email protected]
wrote:

once it failed. That was only 2 databases, what would happen for the >25
databases that we have?

Unless someone dumps a large amount of cash in my lap I have to improve
the situation we are in with what we have to hand rather than throwing
money around.

Sorry to hear you spent some fruitless time there. PostgreSQL supports
various kinds of replication - both internal streaming and Londiste work
very well for many people, so those technology choices are good ones.

Probably trite to say time = money; I do recognise that the functional
relationship between time and money differs for everybody. In general
the
open source ecosystem does have a place for companies that specialise in
building, maintaining and supporting system software to assist in cases
like this.

–
Simon R. http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

tekhne · July 4, 2013, 1:51pm

On 4 July 2013 12:19, Simon R. [email protected] wrote:

Sorry to hear you spent some fruitless time there. PostgreSQL supports
various kinds of replication - both internal streaming and Londiste work
very well for many people, so those technology choices are good ones.

When it was working, and it saved us one one occasion, it worked like a
charm. But then one database started to backlog and nothing I could do
would fix it. Including removing and reinstalling londiste. Given that
this
was when it was with only 2 databases I wasn’t prepared to commit all
our
other databases to it. A false sense of security is not security at all.

Probably trite to say time = money; I do recognise that the functional
relationship between time and money differs for everybody. In general the
open source ecosystem does have a place for companies that specialise in
building, maintaining and supporting system software to assist in cases
like this.

Well I have no money, well certainly not the sort of money that would be
need to pay anyone else to solve this problem for me. Time I have so I
guess that I will have to see what sort of solution I can come up with.

tekhne · July 4, 2013, 1:43pm

On 4 July 2013 12:08, Jordon B. [email protected] wrote:

You really only have two options here if you ask me, learn to do live
snap-shots so you can roll-back or learn to deploy early and often
with small changes that will not drastically break anything and can be
easily rolled back if something bad happens.

Deploying is not the problem. It’s things like the file system becoming
corrupted, which can happen at any time. In fact this is what prompted
this
post in the first place, the file system on a vm became corrupt and we
had
to wipe and rebuild.

“Live with it” is not acceptable either. There is something better than
the
current situation. Even if it is not perfect it can be better.

Logging the changes from ActiveRecord by using the before_* and after_*
hooks might provide an ad hoc transaction journal that could be replayed
against a backup to bring things up to date. At least for databases that
experience a low level of activity.

tekhne · July 5, 2013, 8:33am

On 4 July 2013 12:49, Peter H. [email protected]
wrote:

Well I have no money, well certainly not the sort of money that would be
need to pay anyone else to solve this problem for me. Time I have so I guess
that I will have to see what sort of solution I can come up with.

Those communities would welcome your contributions via bug reports
and/or documentation improvements.

Thanks very much.

–
Simon R. http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

tekhne · July 4, 2013, 4:40pm

On Thu, Jul 4, 2013 at 3:43 AM, Peter H.
[email protected] wrote:

We tried that, we use PostgreSQL and londiste, but when it decided to break
it stayed broken.

I’ve never worked with PostgreSQL in production, but ‘Warm Standby’
(Warm Standby - PostgreSQL wiki) sounds appropriate.
Did you try that and have issues?

And it appears to me that Warm Standby is basically copying over
binary transaction files, which presumably you could store on some
other system (or even e.g. S3) without actually running an alternate
DB server.

Just thinking out loud

Scaling on a budget - so much fun, eh? Good luck!

–
Hassan S. ------------------------ [email protected]

twitter: @hassan