Error handling of "Connection refused" conditions

aris · June 20, 2012, 7:27pm

I am using nginx to distribute http load across three upstream
application servers. It actually works really well in normal usage, but
when we restart one of the upstream servers for maintenance (such as for
an OS update), nginx tends to hang incoming requests for some time (30
seconds or more). When a node is brought offline, the nginx error logs
show a few messages like this:

upstream prematurely closed connection while reading response header
from upstream

Followed by many messages like this (these last basically as long as the
node is offline):

kevent() reported that connect() failed (61: Connection refused) while
connecting to upstream

This seems ok to me, since the upstream server did go away. What should
I expect to see of incoming http requests when this happens? Shouldn’t
nginx route requests to the remaining application servers, making the
outage invisible to users? If it’s expected to hang some requests when
this happens, what timeout could be adjusted to minimize that duration?

Thanks,

.Dustin

Dustin_Wenz · June 20, 2012, 9:04pm

I am using nginx to distribute http load across three upstream application
servers. It actually works really well in normal usage, but when we
restart one of the upstream servers for maintenance (such as for an OS
update), nginx tends to hang incoming requests for some time (30 seconds
or more). When a node is brought offline, the nginx error logs show a few
messages like this:

You should probably post the relevant parts of your configuration since
there are quite few parameters to tune nginx for a better
responsiveness.

To name a few: proxy_connect_timeout, proxy_read_timeout,
proxy_send_timeout
which by default are 60 seconds (what could explain your “hanging”
requests), lowering those allows ‘proxy_next_upstream’ (by default -
timeout
and error) kick in sooner so the backend changes/restarts are seemless
and
don’t really affect the end-users so much.

rr

Dustin_Wenz · June 20, 2012, 9:41pm

This is how an affected service is configured. We have our
proxy_read_timeout set high, because some requests can legitimately take
that long to return. Any other timeouts would be the default; presumably
60 seconds.

upstream pool012 {
server 172.16.1.223:80;
server 172.16.1.224:80;
server 172.16.1.225:80;
server 172.16.1.226:80;
server 172.16.1.227:80;
}

keepalive_timeout 65;

server {
listen 172.16.6.103:80;
server_name application012.domain.com;

  error_log       logs/application012.error.log ;

  location / {
      proxy_pass  http://pool012;
      proxy_next_upstream error timeout invalid_header http_500

http_502 http_503;
proxy_redirect off;
proxy_read_timeout 930s;
}
}

Is it considered good practice to reduce proxy_connect_timeout to some
small value if responsiveness during an outage is desired?

.Dustin

Dustin_Wenz · June 20, 2012, 10:58pm

Is it considered good practice to reduce proxy_connect_timeout to some
small value if responsiveness during an outage is desired?

It depends on the nature of the (backend) software and the network
infrastucture/latency - the closer the servers (network-wise) and the
higher throughput of the application the lower you can/should set it.

There is also a general observation/standard or apdex (Application
Performance Index) that average web-user starts to become “unsatisfied”
if
the request takes more than 1-2 seconds so using the default 60 seconds
(in
case of some backend failure) to deliver the content makes no sense -
the
user will sooner hit refresh / just close the browsers (go to another
site)
or fall in panic than wait for anything to load.

On the other hand making nginx immideatly close and switch to another
backend isnt always the (best) solution - especially if the backends do
some
long running computations and/or accept only limited number of
connections.
This way quickly cycling through all backends won’t serve a good
response at
all what matters if you have an extra caching layer above (to fetch a
valid
object (html/image etc)) or some sort of transactions (like payment
systems)
which just take time and each request is important.

I usually set it to 2-3 seconds.
That way I can also indicate if the backends perform well enough and
react
if a new bottleneck arises - the application code, underlaying
DBs/filesystems start to get too slow.

rr