Hey, we run a website of fairly decent volume… up to nearly 4m
pageviews a day.
At the moment we run a single machine with nginx and mysql and two
worker machines with memcached and tornado instances. The nginx server
is a reverse proxy to the workers and also serves static media.
The CPU load and memory usage on both of the worker boxes are well
within reasonable expectations.
What I am observing is that nginx gets to about 320 requests per second
then requests start backing up. Sometimes taking the server down, see
this image: http://dl.dropbox.com/u/367355/nginx.png
When the server doesn’t go down, we see a flattening of requests around
the 320 mark, and the number of “waiting” requests and the memory usage
of nginx spikes considerably.
I’ve tried upping the number of workers in case all of them are blocking
for long enough to cause this cascading effect (the tornado db driver is
not async) but didn’t really see an improvement by adding more. I’ve
also added lots of async memcached access to avoid hitting the db too
much.
I’ve included the configs below… thanks for any help you may have!
user www-data;
worker_processes 4;
worker_rlimit_nofile 32768;
error_log /dev/null crit;
pid /var/run/nginx.pid;
events {
worker_connections 8192;
use epoll;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /dev/null;
sendfile on;
keepalive_timeout 0;
tcp_nodelay on;
gzip on;
gzip_types text/css text/plain text/javascript
application/x-javascript application/json;
gzip_comp_level 5;
gzip_disable "msie6";
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
Are you certain it’s Nginx and not Tornado? You might try using
issue warning if we block for over 200ms
tornado.ioloop.set_blocking_log_threshold (0.2)
Also you don’t mention how many Tornado backends you have. If you
don’t have at least one Tornado backend per Nginx worker, you are
probably wasting your time trying to tune Nginx.
As an aside, you might check out ngx_postgres or ngx_drizzle for async
db access from Tornado (lets you use Tornado’s async httpclient).
Cliff
On Wed, 2010-08-25 at 18:09 -0400, dpn wrote:
What I am observing is that nginx gets to about 320 requests per second
also added lots of async memcached access to avoid hitting the db too
error_log /dev/null crit;
gzip_comp_level 5;
upstream bar {
if ($query_string) {
proxy_set_header X-Scheme $scheme;
}
Kevin, thanks for your reply… I’ve turned off keepalive because the app
is a mobile app with very simple js and css. There is very little reason
to have keepalive. I’ll try putting it up to test though. Cheers!
Hey Cliff, thanks for the reply. I mentioned in the second post to this
thread that I have a total of 30 workers, 15 on each machine… there
are 4 CPUs on each machine… the extra processes are to pick up any
slack from blocking DB access.
I have indeed used the IO loop blocking debug… coming here really is a
last resort for me! The IOLoop debugging showed some areas I could
improve in, obviously the DB access is unavoidable, but there was also
some CPU intensive spots I could debug which I did. To avoid too many DB
accesses I’m using an async memcached driver. Now I’m in the situation
where the IOLoop debugging issues hardly any messages, the CPU usage is
fairly low, and I’m hardly touching the db!
That’s why I’m here … unfortunately I’ve covered the things you’ve
mentioned.
D
Cliff W. Wrote:
probably wasting your time trying to tune Nginx.
up to nearly 4m
worker boxes are well
flattening of requests around
also added lots of async memcached access to
include /etc/nginx/mime.types;
gzip_types text/css text/plain
Kevin, thanks for your reply… I’ve turned off keepalive because the app
is a mobile app with very simple js and css. There is very little reason
to have keepalive. I’ll try putting it up to test though. Cheers!
That .js and .css are good enough reason to have keepalive turned on.
Regarding your original issue: how much time does it take to generate
single
response from Tornado? I’m asking, because you said that you’ve got 30
blocking workers, which means that if single response takes around
100ms,
then you can handle only about 300req/s.
Most workers aren’t blocking… they only block when they hit the db,
and we are doing lots of caching. At its peak mysql is registering 140
requests per second. I’ve added more workers which has had no effect on
the capacity going through nginx… so either the DB itself is causing
problems (unlikely since it is such a simple schema with no joins) or
something is up with the nginx config.