Upstream max_fails, fail_timeout and proxy_read_timeout

aris · November 16, 2012, 3:15pm

We’re using nginx as a loadbalancer and we’re seeing some strange
behaviour
when one of our backend servers takes a long time to respond to a
request.
We have a configuration like this:

upstream handlehttp {
ip_hash;
server XXX max_fails=3 fail_timeout=30s;
server YYY max_fails=3 fail_timeout=30s;
}

server {
location / {
try_files $uri @backend;
}

location @backend {
proxy_pass http://handlehttp;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_next_upstream error timeout invalid_header http_500 http_502
http_503;
proxy_read_timeout 300;
}
}

What we thought we had configured was:
If one backend server fails more than 3 times within 30 seconds it
would
be considered disabled and all requests sent to the other backend server
(the original server getting request after 30 seconds again).

What we’re actually seeing is that if a a request takes 300+ seconds,
the
backend is immediately set as disabled and all further requests are send
to
the other backend…
Are we missing something or is this the correct behaviour for nginx?

Posted at Nginx Forum:

pliljenberg · November 16, 2012, 4:30pm

Hello!

On Fri, Nov 16, 2012 at 09:15:01AM -0500, pliljenberg wrote:

server {
http_503;
proxy_read_timeout 300;
}
}

What we thought we had configured was:
If one backend server fails more than 3 times within 30 seconds it would
be considered disabled and all requests sent to the other backend server
(the original server getting request after 30 seconds again).

This is what’s expected. Note though, that after the problem was
detected things are handled a bit differently, see below.

What we’re actually seeing is that if a a request takes 300+ seconds, the
backend is immediately set as disabled and all further requests are send to
the other backend…
Are we missing something or is this the correct behaviour for nginx?

Are you looking at the normally working backend server, or a
server which was already considered down?

Note that after nginx 1.1.6 at least one request per worker have
to succeed before “3 times withing 30 seconds” will start to apply
again:

*) Change: if a server in an upstream failed, only one request will

be
sent to it after fail_timeout; the server will be considered
alive if
it will successfully respond to the request.

–
Maxim D.

pliljenberg · November 16, 2012, 4:55pm

Thanks for the reply.

What we’re actually seeing is that if a a request takes 300+ seconds,
the
backend is immediately set as disabled and all further requests are send
to
the other backend…
Are we missing something or is this the correct behaviour for nginx?

Are you looking at the normally working backend server, or a
server which was already considered down?

One server X receives a request which takes 300+ seconds to complete .
That
request gets dropped by nginx due to the read timeout (as expected).
When this happens the server X is disabled and all upcoming request are
sent
to server Y instead.
My interpretation of the configuration was that the server X would still
get
requests since it only had 1 failure (and it 3 as configured) during the
last 30 seconds?

Posted at Nginx Forum:

pliljenberg · November 16, 2012, 5:33pm

Hello!

On Fri, Nov 16, 2012 at 10:54:51AM -0500, pliljenberg wrote:

server which was already considered down?

One server X receives a request which takes 300+ seconds to complete . That
request gets dropped by nginx due to the read timeout (as expected).
When this happens the server X is disabled and all upcoming request are sent
to server Y instead.
My interpretation of the configuration was that the server X would still get
requests since it only had 1 failure (and it 3 as configured) during the
last 30 seconds?

The intresing part is what happens before “one server X receives
a request…”. Is it working normally and handles other requests?
Or it was already considered dead and the request in question is
one to check if it’s alive?

To illustrate, here is what happens with normally working server
(one server on port 9999 is dead, and one at 8080 is responding
normally, fail_timeout=30s, max_fails=3, ip_hash, just started
nginx):

2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:9999, fd:17
#2
2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:8080, fd:17
#3
2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:9999, fd:17
#6
2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:8080, fd:17
#7
2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:9999, fd:17
#10
2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:8080, fd:17
#11
2012/11/16 20:23:31 [debug] 35083#0: *13 connect to 127.0.0.1:8080,
fd:17 #14
2012/11/16 20:23:31 [debug] 35083#0: *16 connect to 127.0.0.1:8080,
fd:17 #17
2012/11/16 20:23:32 [debug] 35083#0: *19 connect to 127.0.0.1:8080,
fd:17 #20
2012/11/16 20:23:33 [debug] 35083#0: *22 connect to 127.0.0.1:8080,
fd:17 #23
2012/11/16 20:23:34 [debug] 35083#0: *25 connect to 127.0.0.1:8080,
fd:17 #26
2012/11/16 20:23:34 [debug] 35083#0: *28 connect to 127.0.0.1:8080,
fd:17 #29
2012/11/16 20:23:35 [debug] 35083#0: *31 connect to 127.0.0.1:8080,
fd:17 #32

As you can see, first 3 requests try to reach port 9999 - because
of max_fails=3.

On the other hand, as long as fail_timeout=30s passes, only one
request try to reach 9999:

2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:9999,
fd:16 #35
2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:8080,
fd:16 #36
2012/11/16 20:24:38 [debug] 35083#0: *38 connect to 127.0.0.1:8080,
fd:16 #39
2012/11/16 20:24:39 [debug] 35083#0: *41 connect to 127.0.0.1:8080,
fd:16 #42

That’s because situations of “normal working server” and “dead
server we are trying to use again” are a bit different.

–
Maxim D.

pliljenberg · November 16, 2012, 5:43pm

The requests before (for more than 30sec) to the server X are ok, this
is
the diet request generating a 500 response (from the timeout).
Son up til this point all looks good - which is why I don’t understand
why
nginx considers the server inactive after the first fail

Posted at Nginx Forum:

pliljenberg · November 16, 2012, 7:52pm

Normally timeouts results in 504, and if you see 500 this might
indicate that in fact request failed not due to a timeout, but
e.g. due too loop detected. This in turn might mean that there
were more than one request to the server X which failed.

Try looking into error_log to see what’s going on.

You’re correct - its a 504.
[16/Nov/2012:12:40:48 +0100] “POST /url HTTP/1.1” 403 454 Time:
300.030
Upstream-time: 300.004, 0.003 Upstream: XXX, YYY Upstream-status: 504,
403

Posted at Nginx Forum:

pliljenberg · November 16, 2012, 6:12pm

Hello!

On Fri, Nov 16, 2012 at 11:42:54AM -0500, pliljenberg wrote:

The requests before (for more than 30sec) to the server X are ok, this is
the diet request generating a 500 response (from the timeout).
Son up til this point all looks good - which is why I don’t understand why
nginx considers the server inactive after the first fail

500 response?

Normally timeouts results in 504, and if you see 500 this might
indicate that in fact request failed not due to a timeout, but
e.g. due too loop detected. This in turn might mean that there
were more than one request to the server X which failed.

Try looking into error_log to see what’s going on.

–
Maxim D.