Upstream max_fails/fail_timeout logic?

Rafa_F · January 30, 2016, 5:31pm

Hello I’ve set up an http proxy to a couple of other servers and am
using max_fails and fail_time in addition to having a proxy_read_timeout
to force failover in case of a read timeout. It seems to work fine, but
I have two questions.

I’m not totally understanding the logic. I can tell that if the
timeout hits the max number of times, it must sit out for the rest of
the fail_timeout time and then it seems to start working again at the
end of the time. But it also seems like it only needs to fail once (i.e.
not a full set of max_fails) to be removed from consideration again. But
then it seems like it doesn’t fail again for a long time, it needs to
fail max_fails times again. How does this logic work exactly?
Is the fact that an upstream server is taken down (in this temporary
fashion) logged somewhere? I.e. some file where it just says “server hit
max fails” or something?
Extending 2), is there any way to “hook” into that server failure?
I.e. if the server fails, is there a way with nginx to execute some sort
of a program (either internal or external)?

Thanks for any help! I’ve been reading the documentation, but I get lost
at times so if it’s written there and I’m just being an idiot, please
tell me to RTFM (with a link if possible please ).

Also I forgot to mention, I’m using the community version on Linux mint:

$ nginx -v
nginx version: nginx/1.4.6 (Ubuntu)

$ uname -a
Linux mint-carbon 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Cheers,
Thomas

Thomas_Nyberg · February 1, 2016, 3:00pm

Hello!

On Sat, Jan 30, 2016 at 05:31:14PM +0100, Thomas Nyberg wrote:

it doesn’t fail again for a long time, it needs to fail max_fails times
again. How does this logic work exactly?

After fail_timeout, one request will be passed to the server in
question. The server is considered again alive if the request
succeeds. If the request fails, nginx will wait for fail_timeout
again.

Note that this is actually consistent with max_fails counting
logic as well, as failures are actually counted not with
fail_timeout sliding window, but within a session with
fail_timeout timeout. That is, fail_timeout defines minimal
interval between failures for nginx to forget about previous
failures.

E.g., with max_fails=5 fail_timeout=10s, if a server fails 1
request each 5 seconds, it will be considered down after 5
failures happened during previous 20 seconds.

Is the fact that an upstream server is taken down (in this temporary
fashion) logged somewhere? I.e. some file where it just says “server hit max
fails” or something?

In recent versions (1.9.1+) the “upstream server temporarily
disabled” warning will be logged.

Extending 2), is there any way to “hook” into that server failure? I.e.
if the server fails, is there a way with nginx to execute some sort of a
program (either internal or external)?

No (except by monitoring logs).

Note well that “down” state is per worker (unless you are using
upstream zone to share state between worker processes), and this
also complicates things.
consider all

In general it’s a good idea to monitor backends separately, and
don’t expect nginx to do anything if a backend fails.

–
Maxim D.
http://nginx.org/