Nginx capping at 320 requests per second

dpn · August 26, 2010, 12:09am

Hey, we run a website of fairly decent volume… up to nearly 4m
pageviews a day.

At the moment we run a single machine with nginx and mysql and two
worker machines with memcached and tornado instances. The nginx server
is a reverse proxy to the workers and also serves static media.

The CPU load and memory usage on both of the worker boxes are well
within reasonable expectations.

What I am observing is that nginx gets to about 320 requests per second
then requests start backing up. Sometimes taking the server down, see
this image: http://dl.dropbox.com/u/367355/nginx.png

When the server doesn’t go down, we see a flattening of requests around
the 320 mark, and the number of “waiting” requests and the memory usage
of nginx spikes considerably.

I’ve tried upping the number of workers in case all of them are blocking
for long enough to cause this cascading effect (the tornado db driver is
not async) but didn’t really see an improvement by adding more. I’ve
also added lots of async memcached access to avoid hitting the db too
much.

I’ve included the configs below… thanks for any help you may have!

user www-data;
worker_processes 4;
worker_rlimit_nofile 32768;


error_log  /dev/null crit;
pid        /var/run/nginx.pid;

events {
    worker_connections  8192;
    use epoll;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    access_log  /dev/null;

    sendfile        on;

    keepalive_timeout  0;
    tcp_nodelay        on;

    gzip  on;
    gzip_types text/css text/plain text/javascript
application/x-javascript application/json;
    gzip_comp_level 5;
    gzip_disable     "msie6";

    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
}

upstream bar {
 server worker1:8888 max_fails=1 fail_timeout=10s;
 server worker2:8888 max_fails=1 fail_timeout=10s;
}


server { # simple reverse-proxy
    listen       80;
    server_name  bar.net;
    #access_log   logs/bar.access.log;
    access_log /dev/null;


    location /nginx_status {
            stub_status on;
            access_log off;
            allow 127.0.0.1;
            deny all;
    }

    location ^~ /static/ {
        root /home/foo/bar;
        if ($query_string) {
                expires max;
            }
        }

    # pass requests for dynamic content to tornado
    location / {
            proxy_pass_header Server;
            proxy_set_header Host $http_host;
            proxy_redirect false;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Scheme $scheme;
            proxy_pass      http://tweete;
    }
    error_page 411 /411.html;
    location = /411.html {
         root  /home/foo/bar/static/error;
    }

    error_page 500 502 503 504  /500.html;
    location = /500.html {
         root  /home/foo/bar/static/error;
     }
  }

Posted at Nginx Forum:

dpn · August 26, 2010, 12:10am

I should note that I’m running 15 workers on each worker box… I just
cut them out of the upstream section to make the config shorter.

Posted at Nginx Forum:

dpn · August 26, 2010, 1:01am

Are you certain it’s Nginx and not Tornado? You might try using

issue warning if we block for over 200ms

tornado.ioloop.set_blocking_log_threshold (0.2)

Also you don’t mention how many Tornado backends you have. If you
don’t have at least one Tornado backend per Nginx worker, you are
probably wasting your time trying to tune Nginx.

As an aside, you might check out ngx_postgres or ngx_drizzle for async
db access from Tornado (lets you use Tornado’s async httpclient).

Cliff

On Wed, 2010-08-25 at 18:09 -0400, dpn wrote:

What I am observing is that nginx gets to about 320 requests per second
also added lots of async memcached access to avoid hitting the db too
error_log /dev/null crit;
gzip_comp_level 5;
upstream bar {
    if ($query_string) {
        proxy_set_header X-Scheme $scheme;
 }
nginx Info Page
–

dpn · August 26, 2010, 1:42am

Sorry, it seems my posts have been moderated… I’ll wait it out until
someone spots it.

Posted at Nginx Forum:

dpn · August 31, 2010, 8:19am

Hello again, we added another machine with 15 workers and are still
getting the 320rps cap from nginx:
http://dl.dropbox.com/u/367355/nginxday.png

I’ve posted my config earlier… I don’t suppose anyone has a suggestion
about what might be causing the issue?

Posted at Nginx Forum:

dpn · September 1, 2010, 2:22am

Kevin, thanks for your reply… I’ve turned off keepalive because the app
is a mobile app with very simple js and css. There is very little reason
to have keepalive. I’ll try putting it up to test though. Cheers!

Posted at Nginx Forum:

dpn · August 26, 2010, 1:45am

Hey Cliff, thanks for the reply. I mentioned in the second post to this
thread that I have a total of 30 workers, 15 on each machine… there
are 4 CPUs on each machine… the extra processes are to pick up any
slack from blocking DB access.

I have indeed used the IO loop blocking debug… coming here really is a
last resort for me! The IOLoop debugging showed some areas I could
improve in, obviously the DB access is unavoidable, but there was also
some CPU intensive spots I could debug which I did. To avoid too many DB
accesses I’m using an async memcached driver. Now I’m in the situation
where the IOLoop debugging issues hardly any messages, the CPU usage is
fairly low, and I’m hardly touching the db!

That’s why I’m here … unfortunately I’ve covered the things you’ve
mentioned.

D

Cliff W. Wrote:

probably wasting your time trying to tune Nginx.
up to nearly 4m
worker boxes are well
flattening of requests around
also added lots of async memcached access to
include       /etc/nginx/mime.types;
gzip_types text/css text/plain
server { # simple reverse-proxy
deny all;
tornado
location = /411.html {
–

nginx mailing list
[email protected]
nginx Info Page

Posted at Nginx Forum:

dpn · September 1, 2010, 2:29am

Hi,

Kevin, thanks for your reply… I’ve turned off keepalive because the app
is a mobile app with very simple js and css. There is very little reason
to have keepalive. I’ll try putting it up to test though. Cheers!

That .js and .css are good enough reason to have keepalive turned on.

Regarding your original issue: how much time does it take to generate
single
response from Tornado? I’m asking, because you said that you’ve got 30
blocking workers, which means that if single response takes around
100ms,
then you can handle only about 300req/s.

Best regards,
Piotr S. < [email protected] >

dpn · September 1, 2010, 3:15am

post to subscribe to the thread… please disregard.

Posted at Nginx Forum:

dpn · September 1, 2010, 3:09am

Piotr,

Most workers aren’t blocking… they only block when they hit the db,
and we are doing lots of caching. At its peak mysql is registering 140
requests per second. I’ve added more workers which has had no effect on
the capacity going through nginx… so either the DB itself is causing
problems (unlikely since it is such a simple schema with no joins) or
something is up with the nginx config.

Thanks for giving this some thought!

David

Posted at Nginx Forum:

dpn · September 6, 2010, 4:39pm

OK, I turned keepalive back on (5 seconds) and we are cooking with
gas… getting up to 400 requests per second. Thanks everyone

Posted at Nginx Forum:

dpn · September 6, 2010, 4:39pm

Another update… with keepalive 5 the server gets to 450 requests per
second then does the weird levelling off thing I’ve mentioned before.

http://dl.dropbox.com/u/367355/nginx450.png

Posted at Nginx Forum: