Forum: NGINX Question about failure and fail-over

Ad43d99c9520c317fbfe35ff89bd2365?d=identicon&s=25 Branden Visser (Guest)
on 2013-07-18 13:11
(Received via mailing list)
Hi all, I have a general question about server failure and failover
within an upstream group to ensure I understand it correctly.

Lets say I have the configuration:

proxy_next_upstream timeout;
proxy_connect_timeout 5;
...
upstream {
  127.0.0.1 max_fails=3 fail_timeout=10s
  127.0.0.2 max_fails=3 fail_timeout=10s
  127.0.0.3 max_fails=3 fail_timeout=10s
}

And then the server 127.0.0.1 starts "hanging" indefinitely on
connection attempts.

a) Once 3 connection attempts timeout after 5 seconds on 127.0.0.1, it
will be marked down. However, during that 5 second timeout, it is
possible that 30, or N connections / requests may be in process of
timing out as well, so you may end up with 30 internal connection
failures as a result of 127.0.0.1's issue. Although they all are
retried on the next available upstream, 30 end-users noticed a 5
second hang in their request as a result of waiting for the timeout to
occur.

b) After 10 seconds, if the server is still hanging, a) basically
repeats in the same manner.

Is this correct? If I add "keepalive 64;" into the upstream block,
does the above scenario change? If a server is marked down as a result
of no new connections being able to connect, are all persistent
connections destroyed as well?

Any insight on this understanding would be appreciated.

Cheers,
Branden
A8108a0961c6087c43cda32c8616dcba?d=identicon&s=25 Maxim Dounin (Guest)
on 2013-07-18 16:28
(Received via mailing list)
Hello!

On Thu, Jul 18, 2013 at 07:10:27AM -0400, Branden Visser wrote:

>   127.0.0.2 max_fails=3 fail_timeout=10s
> failures as a result of 127.0.0.1's issue. Although they all are
> retried on the next available upstream, 30 end-users noticed a 5
> second hang in their request as a result of waiting for the timeout to
> occur.

Yep.  Use least_conn balancer to mitigate such kind of backend
problems, see http://nginx.org/r/least_conn.

Additionally, it's usually good idea to make sure your backends
return RST on listen queue overflow.  On most Linux systems
default seems to be just to drop SYN packets on listen queue
overflow, which will result in an unbound number of connections
waiting for a timeout.  Changing
/proc/sys/net/ipv4/tcp_abort_on_overflow might be good idea, see
here for details:

http://man7.org/linux/man-pages/man7/tcp.7.html

> b) After 10 seconds, if the server is still hanging, a) basically
> repeats in the same manner.

No.  As of 1.1.6+, only single request will be routed to the
server after fail_timeout.  The server will be considered up only
if it will be able to respond to this request.

> Is this correct? If I add "keepalive 64;" into the upstream block,
> does the above scenario change? If a server is marked down as a result
> of no new connections being able to connect, are all persistent
> connections destroyed as well?

Balancing doesn't know anything about cached connections.  If a
server is marked down, no attempts to use cached connections to
the server will be made, and eventually all connections to the
server will be replaced with connections to other servers, as per
LRU algorthm.

--
Maxim Dounin
http://nginx.org/en/donation.html
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.