Question about failure and fail-over

Rafa_F · July 18, 2013, 1:11pm

Hi all, I have a general question about server failure and failover
within an upstream group to ensure I understand it correctly.

Lets say I have the configuration:

proxy_next_upstream timeout;
proxy_connect_timeout 5;
…
upstream {
127.0.0.1 max_fails=3 fail_timeout=10s
127.0.0.2 max_fails=3 fail_timeout=10s
127.0.0.3 max_fails=3 fail_timeout=10s
}

And then the server 127.0.0.1 starts “hanging” indefinitely on
connection attempts.

a) Once 3 connection attempts timeout after 5 seconds on 127.0.0.1, it
will be marked down. However, during that 5 second timeout, it is
possible that 30, or N connections / requests may be in process of
timing out as well, so you may end up with 30 internal connection
failures as a result of 127.0.0.1’s issue. Although they all are
retried on the next available upstream, 30 end-users noticed a 5
second hang in their request as a result of waiting for the timeout to
occur.

b) After 10 seconds, if the server is still hanging, a) basically
repeats in the same manner.

Is this correct? If I add “keepalive 64;” into the upstream block,
does the above scenario change? If a server is marked down as a result
of no new connections being able to connect, are all persistent
connections destroyed as well?

Any insight on this understanding would be appreciated.

Cheers,
Branden

Branden_V · July 18, 2013, 4:28pm

Hello!

On Thu, Jul 18, 2013 at 07:10:27AM -0400, Branden V. wrote:

127.0.0.2 max_fails=3 fail_timeout=10s
failures as a result of 127.0.0.1’s issue. Although they all are
retried on the next available upstream, 30 end-users noticed a 5
second hang in their request as a result of waiting for the timeout to
occur.

Yep. Use least_conn balancer to mitigate such kind of backend
problems, see Module ngx_http_upstream_module.

Additionally, it’s usually good idea to make sure your backends
return RST on listen queue overflow. On most Linux systems
default seems to be just to drop SYN packets on listen queue
overflow, which will result in an unbound number of connections
waiting for a timeout. Changing
/proc/sys/net/ipv4/tcp_abort_on_overflow might be good idea, see
here for details:

http://man7.org/linux/man-pages/man7/tcp.7.html

b) After 10 seconds, if the server is still hanging, a) basically
repeats in the same manner.

No. As of 1.1.6+, only single request will be routed to the
server after fail_timeout. The server will be considered up only
if it will be able to respond to this request.

Is this correct? If I add “keepalive 64;” into the upstream block,
does the above scenario change? If a server is marked down as a result
of no new connections being able to connect, are all persistent
connections destroyed as well?

Balancing doesn’t know anything about cached connections. If a
server is marked down, no attempts to use cached connections to
the server will be made, and eventually all connections to the
server will be replaced with connections to other servers, as per
LRU algorthm.

–
Maxim D.
http://nginx.org/en/donation.html