Potential concurrency issues with a global hash to reduce DB

Hi folks:

I’m building something akin to Google Analytics and currently I’m doing
real time database updates. However this isn’t really scaling anymore,
so I need to stop doing synchronous DB operations. In the short term,
I’m trying to reduce DB writes, so I’m thinking about a global hash (say
declared in environment.rb) that is accessible from my controllers and
models that I can write to in lieu of writing to the DB. Every so often
I can have a task write the updates that need to be written to the DB.

Question:

  1. Does this sound reasonable?
  2. Will I run into any concurrency problems? Which ones?
  3. How does this compare with writing to the file system and importing
    later?

Thanks so much in advance, you ROCK!

If you’re not running on rails threadsafe! mode you’ll have no
concurrency issues. A messaging queue could also be a solution to your
problem.

Maurício Linhares
http://codeshooter.wordpress.com/ | http://twitter.com/mauriciojr

On Mon, Aug 17, 2009 at 8:56 PM, Frank

Frank P. wrote:

Hi folks:

I’m building something akin to Google Analytics and currently I’m doing
real time database updates. However this isn’t really scaling anymore,
so I need to stop doing synchronous DB operations. In the short term,
I’m trying to reduce DB writes, so I’m thinking about a global hash (say
declared in environment.rb) that is accessible from my controllers and
models that I can write to in lieu of writing to the DB. […]

  1. Does this sound reasonable?

No! Databases should scale as much as you need them to – that’s what
they are designed for. If your DB is not scaling, there is something
wrong, either in your DB setup or in the way the app is using the DB.
Find the problem and fix it rather than ripping out the DB.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Frank P. wrote:

Marnen: here are some more details. I need to be able to handle at least
10 API requests / second. Let met clarify what happens:

  1. User makes a RESTful API request
  2. I find a record in a database, return it as JSON
  3. I record the request counter for the user in the database (i.e. if I
    user makes 2 API calls, I record ‘2’).

#1 and #2 are really fast in SQL - they are SELECTs. #3 is really slow,
because it’s an UPDATE. In the real world, my database (MySQL) is NOT
scaling.

Well, there’s your first problem – you’re using mySQL. :slight_smile: Try
PostgreSQL instead.

(That was admittedly snarky. Although PostgreSQL is certainly the
better choice, mySQL should still work.)

According to New Relic, #3 is taking most of the time.

What does that query look like? A simple UPDATE query should not be a
major performance hit, so there must be something else going on.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Marnen: here are some more details. I need to be able to handle at least
10 API requests / second. Let met clarify what happens:

  1. User makes a RESTful API request
  2. I find a record in a database, return it as JSON
  3. I record the request counter for the user in the database (i.e. if I
    user makes 2 API calls, I record ‘2’).

#1 and #2 are really fast in SQL - they are SELECTs. #3 is really slow,
because it’s an UPDATE. In the real world, my database (MySQL) is NOT
scaling. According to New Relic, #3 is taking most of the time.

Any suggestions?

Marnen Laibow-Koser wrote:

Frank P. wrote:

Hi folks:

I’m building something akin to Google Analytics and currently I’m doing
real time database updates. However this isn’t really scaling anymore,
so I need to stop doing synchronous DB operations. In the short term,
I’m trying to reduce DB writes, so I’m thinking about a global hash (say
declared in environment.rb) that is accessible from my controllers and
models that I can write to in lieu of writing to the DB. […]

  1. Does this sound reasonable?

No! Databases should scale as much as you need them to – that’s what
they are designed for. If your DB is not scaling, there is something
wrong, either in your DB setup or in the way the app is using the DB.
Find the problem and fix it rather than ripping out the DB.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Frank P. wrote:

The query is just an UPDATE with a COALESCE keyword in it. All it does
is increment counters:

UPDATE statistics_api SET count_request = COALESCE(count_request,
?) + ? WHERE (id = ?)

This one took 8,553 ms according to New Relic. The weird part is that
there is an index on ‘id’ (as well as a few other columns) on this
table. I just can’t figure out why the query’s taking so long …

8.5 seconds? For that?!? You might want to ask someone who knows more
about mySQL configuration than I do, but clearly your DB setup has
problems. If you run that query when your DB is not otherwise being hit
by your app, how long does it take?

BTW, what does the first placeholder represent? I’d assume it would
always represent 0, in which case you can hard-code it.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

This query is lightning fast when my server is NOT under load. Just dead
slow when I’m seeing a lot of load. You’re right, the first placeholder
is probably redundant…

Marnen Laibow-Koser wrote:

Frank P. wrote:

The query is just an UPDATE with a COALESCE keyword in it. All it does
is increment counters:

UPDATE statistics_api SET count_request = COALESCE(count_request,
?) + ? WHERE (id = ?)

This one took 8,553 ms according to New Relic. The weird part is that
there is an index on ‘id’ (as well as a few other columns) on this
table. I just can’t figure out why the query’s taking so long …

8.5 seconds? For that?!? You might want to ask someone who knows more
about mySQL configuration than I do, but clearly your DB setup has
problems. If you run that query when your DB is not otherwise being hit
by your app, how long does it take?

BTW, what does the first placeholder represent? I’d assume it would
always represent 0, in which case you can hard-code it.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

This one took 8,553 ms according to New Relic. The weird part is

Frank P. wrote:
This query is lightning fast when my server is NOT under load. Just
dead
slow when I’m seeing a lot of load. You’re right, the first
placeholder
is probably redundant…

That would likely indicate that the DB is pretty dern busy fulfilling
reads (though someone should analyse that to prove it). When doing
this kind of logging it often becomes necessary to have a second
database which is optimized for writes to take the continual updates
so that your read/write-infrequent data set is left alone. Those log
writes should also be fired off as an asyncronous task so that the
rest of the page/request is not slowed down waiting for the write to
occur.

I agree that yanking the DB in favor of in-memory workload doesn’t
seem appropriate yet.


def gw
writes_at ‘www.railsdev.ws’
end

Greg, what do you recommend to do async writes? I’m a newbie to async
tasks in Rails, but doesn’t something like delayed_job ALSO us the
database to keep track of jobs?

Greg W. wrote:

That would likely indicate that the DB is pretty dern busy fulfilling
reads (though someone should analyse that to prove it). When doing
this kind of logging it often becomes necessary to have a second
database which is optimized for writes to take the continual updates
so that your read/write-infrequent data set is left alone. Those log
writes should also be fired off as an asyncronous task so that the
rest of the page/request is not slowed down waiting for the write to
occur.

I agree that yanking the DB in favor of in-memory workload doesn’t
seem appropriate yet.

Frank P. wrote:

Greg, what do you recommend to do async writes? I’m a newbie to async
tasks in Rails, but doesn’t something like delayed_job ALSO us the
database to keep track of jobs?

It may be possible to use the database’s own clustering features to do
this and still let your app treat it as if it were one simple DB. If
not, would the masochism plugin help?

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

The query is just an UPDATE with a COALESCE keyword in it. All it does
is increment counters:

UPDATE statistics_api SET count_request = COALESCE(count_request,
?) + ? WHERE (id = ?)

This one took 8,553 ms according to New Relic. The weird part is that
there is an index on ‘id’ (as well as a few other columns) on this
table. I just can’t figure out why the query’s taking so long …

Marnen Laibow-Koser wrote:

Frank P. wrote:

Marnen: here are some more details. I need to be able to handle at least
10 API requests / second. Let met clarify what happens:

  1. User makes a RESTful API request
  2. I find a record in a database, return it as JSON
  3. I record the request counter for the user in the database (i.e. if I
    user makes 2 API calls, I record ‘2’).

#1 and #2 are really fast in SQL - they are SELECTs. #3 is really slow,
because it’s an UPDATE. In the real world, my database (MySQL) is NOT
scaling.

Well, there’s your first problem – you’re using mySQL. :slight_smile: Try
PostgreSQL instead.

(That was admittedly snarky. Although PostgreSQL is certainly the
better choice, mySQL should still work.)

According to New Relic, #3 is taking most of the time.

What does that query look like? A simple UPDATE query should not be a
major performance hit, so there must be something else going on.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

We use InnoDB. I did some more research on this, and it turns out that
my app was literally overloading the db (non-clustered) with UPDATEs.
Since all these updates do is increment counters, I cached the
increments in memory and flush them every once in a while. This
dramatically raised my app’s performance, without the need for a
complicated database setup.

Wojciech P. wrote:

Which MySQL engine do you use for this table? MyISAM is faster for
read operations, but locks the whole table for an update or an insert
and this might affect the performance a lot. InnoDB engine uses row-
level locking. Note that PostgreSQL does too.

Have you thought about using document DBs like Tokyo Cabinet, MongoDB
or others? They can handle huge loads.

On Aug 18, 5:39�am, Frank P. [email protected]

Which MySQL engine do you use for this table? MyISAM is faster for
read operations, but locks the whole table for an update or an insert
and this might affect the performance a lot. InnoDB engine uses row-
level locking. Note that PostgreSQL does too.

Have you thought about using document DBs like Tokyo Cabinet, MongoDB
or others? They can handle huge loads.

On Aug 18, 5:39 am, Frank P. [email protected]