Nginx as Load Balancer Connection Issues

luislavena · January 6, 2012, 10:50pm

We have a box running nginx and two boxes running apache. The apache
boxes are configured as an upstream for nginx.

The nginx box has a public IP, and then it talks to the upstream apaches
using the private network (same switch). We are sustaining a couple
hundred requests/sec.

We’ve had several issues with the upstreams being counted out by nginx,
causing the “no live upstreams” message in the error log and end users
seeing 502 errors. When this happens the machines are barely being
used, single digit load averages in 16 core boxes.

Initially we were seeing a ton of “connect() failed (110: Connection
timed out)”, 1 every couple seconds. I added these to sysctl.conf and
that seemed to solve the problem:

net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 20480
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_max_tw_buckets = 400000
net.core.somaxconn = 4096

Now things generally run fine but every once in awhile we get a huge
burst of “upstream prematurely closed connection while reading response
header from upstream” followed by a “no live upstreams”. Again, no
apparent load on the machines involved. These bursts only last a minute
or so. We also still get an occasional “connect() failed (110:
Connection timed out)” but they are far less frequent, perhaps 1 or 2
per hour.

Anyone have recommendations for tuning the networking side to improve
the situation here? These are some of the nginx.conf settings we have
in place, removed the ones that don’t seem related to the issue:

worker_processes 4;
worker_rlimit_nofile 30000;
events {
worker_connections 4096;
# multi_accept on;

use epoll;

}
http {
client_max_body_size 200m;

proxy_read_timeout 600s;
proxy_send_timeout 600s;
proxy_connect_timeout 60s;

proxy_buffer_size 128k;
proxy_buffers 4 128k;

keepalive_timeout  0;
tcp_nodelay        on;

}

Happy to provide any other details. This is the “ulimit -a” on all
boxes:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 300000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Posted at Nginx Forum:

gtuhl · January 24, 2012, 12:01am

gtuhl Wrote:

net.core.somaxconn = 4096
hour.

On looking at this again recently, we made two adjustments that
eliminated the connection issues completely:

net.nf_conntrack_max = 262144
net.ipv4.ip_local_port_range = 1024 65000

After making those two changes things became quite stable. However, we
still have massive numbers of TIME_WAIT connections both on the nginx
machine and on the upstream apache machines.

The nginx machine is accepting roughly 1000 requests/s, and has 40,000
connections in TIME_WAIT.
The apache machines are each accepting roughly 250 requests/s, and have
15,000 connections in TIME_WAIT.

We tried setting net.ipv4.tcp_tw_reuse to 1 and restarting networking.
That did not cause any trouble, but also didn’t drop the TIME_WAIT
count. I have read that net.ipv4.tcp_tw_recycle is dangerous but we may
try that if others have had good experiences.

Is there a way to have these cleaned up more quickly? My concern is
that even with the expanded ip_local_port_range 40k is cutting it rather
close. Before we bumped ip_local_port_range the whole system was
falling down right as the TIME_WAIT count approached 32k. Is it normal
for nginx to cause this many TIME_WAIT connections? If we’re only doing
1k requests/s and nearly exhausting the available port range what would
sites with heavier volume do?

Posted at Nginx Forum:

gtuhl · January 24, 2012, 7:00pm

net.ipv4.tcp_tw_recycle = 1

is what your looking for

Posted at Nginx Forum:

gtuhl · January 24, 2012, 7:23pm

Andrey Korolyov Wrote:

nginx mailing list
[email protected]
nginx Info Page

This may cause trouble if multiple clients trying
to reach the server
over same NAT, so be careful. I have a negative
experience even on ~
10 http reqs/min from NAT machine.

This is what I had read everywhere as well, so I’ve been hesitant to try
it. We definitely have a lot of users that would be coming at our
servers from the same buliding/NAT.

Has anyone tried using “net.ipv4.tcp_tw_reuse = 1” in a larger
connection count environment before?

I have it enabled now, but it did not seem to have any impact on the
number of TIME_WAIT connections. Does it wait until it actually needs
to reuse one (due to port exhaustion) before doing so? Or should it be
keeping the number lower?

Posted at Nginx Forum:

gtuhl · January 24, 2012, 7:14pm

On Tue, Jan 24, 2012 at 9:59 PM, ggrensteiner [email protected]
wrote:

net.ipv4.tcp_tw_recycle = 1

is what your looking for

Posted at Nginx Forum:
Re: Nginx as Load Balancer Connection Issues

nginx mailing list
[email protected]
nginx Info Page

This may cause trouble if multiple clients trying to reach the server
over same NAT, so be careful. I have a negative experience even on ~
10 http reqs/min from NAT machine.

gtuhl · January 26, 2012, 12:15am

Have you tried using HTTP 1.1 keepalive connections from nginx to
apache? They became available in 1.1.4 and will re-use sockets rather
then close them and leaving them in TIME_WAIT

Be sure to remember to turn on keepalive in your apache config as well.

http://nginx.org/en/docs/http/ngx_http_upstream_module.html

Posted at Nginx Forum:

gtuhl · March 20, 2012, 10:34pm

I’m thinking about giving the development version with the upstream
keepalive over http 1.1 a try.

Are people using that version in production? Is there a release
schedule/estimate anywhere that indicates when that feature might
trickle over to stable?

We’re using nginx heavily in a pretty vanilla load balancer role -
upstream of apache servers, ssl termination in nginx, that’s it in terms
of features we are using.

It’s worked fantastically well overall, we’re just flirting with an
ephemeral port limit on a few of our sites (have worked around by
setting up multiple A records pointed at multiple nginx pairs). If we
could get keepalive connections between nginx and the upstream apaches I
believe we would be in very good shape and could keep our configuration
simple moving forward.

Posted at Nginx Forum:

gtuhl · January 26, 2012, 12:22am

Out of curiosity why would it keep it in TIME_WAIT if it is closing the
connection?

gtuhl · March 20, 2012, 10:42pm

On Tue, Mar 20, 2012 at 11:33 PM, gtuhl [email protected] wrote:

I’m thinking about giving the development version with the upstream
keepalive over http 1.1 a try.

Are people using that version in production? Is there a release
schedule/estimate anywhere that indicates when that feature might
trickle over to stable?

According to their roadmap – in 6 days
http://trac.nginx.org/nginx/roadmap

gtuhl · March 20, 2012, 10:47pm

On Thu, Jan 26, 2012 at 7:21 AM, Rami E. [email protected]
wrote:

Out of curiosity why would it keep it in TIME_WAIT if it is closing the
connection?

+1. Also if the connection is closed, why is the upstream (apache) in
TIME_WAIT also?

gtuhl · March 21, 2012, 5:59pm

Alexandr G. Wrote:

trickle over to stable?

According to their roadmap – in 6 days
Roadmap – nginx

This is excellent news. Also apologies for somehow missing this page,
was exactly what I was looking for.

Posted at Nginx Forum:

gtuhl · May 1, 2012, 3:26am

Initial testing with 1.2.0 and 1.1 keepalive to upstreams has our
ephemeral port usage down from 38,000 to 220 on a canned test run. This
is a big deal, we can use nginx for reverse proxy on far busier sites
now.

Anyone put this under heavy usage in production yet?

New release seems to be working brilliantly, good work to all involved.

Posted at Nginx Forum:

gtuhl · May 1, 2012, 7:50am

On May 1, 2012, at 5:26 , gtuhl wrote:

Initial testing with 1.2.0 and 1.1 keepalive to upstreams has our
ephemeral port usage down from 38,000 to 220 on a canned test run. This
is a big deal, we can use nginx for reverse proxy on far busier sites
now.

Anyone put this under heavy usage in production yet?

Yes.

Somewhere from 1.1.4 or so.

gtuhl · March 28, 2012, 4:27pm

Looks like that was for the 1.1.18 development release. Is this what
will become the 1.2.0 stable in a couple weeks? Seems I’ll need to wait
for that one to get http 1.1 keepalive upstreams in stable.

gtuhl Wrote:

Is there a release
somehow missing this page, was exactly what I was
looking for.

Posted at Nginx Forum: