We're using nginx as a loadbalancer and we're seeing some strange
behaviour
when one of our backend servers takes a long time to respond to a
request.
We have a configuration like this:
upstream handlehttp {
ip_hash;
server XXX max_fails=3 fail_timeout=30s;
server YYY max_fails=3 fail_timeout=30s;
}
server {
location / {
try_files $uri @backend;
}
location @backend {
proxy_pass http://handlehttp;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_next_upstream error timeout invalid_header http_500 http_502
http_503;
proxy_read_timeout 300;
}
}
What we thought we had configured was:
If one backend server fails more than 3 times within 30 seconds it
would
be considered disabled and all requests sent to the other backend server
(the original server getting request after 30 seconds again).
What we're actually seeing is that if a a request takes 300+ seconds,
the
backend is immediately set as disabled and all further requests are send
to
the other backend...
Are we missing something or is this the correct behaviour for nginx?
Posted at Nginx Forum:
http://forum.nginx.org/read.php?2,232912,232912#msg-232912
on 2012-11-16 15:15
on 2012-11-16 16:30
Hello! On Fri, Nov 16, 2012 at 09:15:01AM -0500, pliljenberg wrote: > server { > http_503; > proxy_read_timeout 300; > } > } > > What we thought we had configured was: > If one backend server fails more than 3 times within 30 seconds it would > be considered disabled and all requests sent to the other backend server > (the original server getting request after 30 seconds again). This is what's expected. Note though, that after the problem was detected things are handled a bit differently, see below. > What we're actually seeing is that if a a request takes 300+ seconds, the > backend is immediately set as disabled and all further requests are send to > the other backend... > Are we missing something or is this the correct behaviour for nginx? Are you looking at the normally working backend server, or a server which was already considered down? Note that after nginx 1.1.6 at least one request per worker have to succeed before "3 times withing 30 seconds" will start to apply again: *) Change: if a server in an upstream failed, only one request will be sent to it after fail_timeout; the server will be considered alive if it will successfully respond to the request. -- Maxim Dounin http://nginx.com/support.html
on 2012-11-16 16:55
Thanks for the reply. >> What we're actually seeing is that if a a request takes 300+ seconds, the >> backend is immediately set as disabled and all further requests are send to >> the other backend... >> Are we missing something or is this the correct behaviour for nginx? >Are you looking at the normally working backend server, or a >server which was already considered down? One server X receives a request which takes 300+ seconds to complete . That request gets dropped by nginx due to the read timeout (as expected). When this happens the server X is disabled and all upcoming request are sent to server Y instead. My interpretation of the configuration was that the server X would still get requests since it only had 1 failure (and it 3 as configured) during the last 30 seconds? Posted at Nginx Forum: http://forum.nginx.org/read.php?2,232912,232917#msg-232917
on 2012-11-16 17:33
Hello! On Fri, Nov 16, 2012 at 10:54:51AM -0500, pliljenberg wrote: > >server which was already considered down? > > One server X receives a request which takes 300+ seconds to complete . That > request gets dropped by nginx due to the read timeout (as expected). > When this happens the server X is disabled and all upcoming request are sent > to server Y instead. > My interpretation of the configuration was that the server X would still get > requests since it only had 1 failure (and it 3 as configured) during the > last 30 seconds? The intresing part is what happens _before_ "one server X receives a request...". Is it working normally and handles other requests? Or it was already considered dead and the request in question is one to check if it's alive? To illustrate, here is what happens with normally working server (one server on port 9999 is dead, and one at 8080 is responding normally, fail_timeout=30s, max_fails=3, ip_hash, just started nginx): 2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:9999, fd:17 #2 2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:8080, fd:17 #3 2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:9999, fd:17 #6 2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:8080, fd:17 #7 2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:9999, fd:17 #10 2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:8080, fd:17 #11 2012/11/16 20:23:31 [debug] 35083#0: *13 connect to 127.0.0.1:8080, fd:17 #14 2012/11/16 20:23:31 [debug] 35083#0: *16 connect to 127.0.0.1:8080, fd:17 #17 2012/11/16 20:23:32 [debug] 35083#0: *19 connect to 127.0.0.1:8080, fd:17 #20 2012/11/16 20:23:33 [debug] 35083#0: *22 connect to 127.0.0.1:8080, fd:17 #23 2012/11/16 20:23:34 [debug] 35083#0: *25 connect to 127.0.0.1:8080, fd:17 #26 2012/11/16 20:23:34 [debug] 35083#0: *28 connect to 127.0.0.1:8080, fd:17 #29 2012/11/16 20:23:35 [debug] 35083#0: *31 connect to 127.0.0.1:8080, fd:17 #32 As you can see, first 3 requests try to reach port 9999 - because of max_fails=3. On the other hand, as long as fail_timeout=30s passes, only one request try to reach 9999: 2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:9999, fd:16 #35 2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:8080, fd:16 #36 2012/11/16 20:24:38 [debug] 35083#0: *38 connect to 127.0.0.1:8080, fd:16 #39 2012/11/16 20:24:39 [debug] 35083#0: *41 connect to 127.0.0.1:8080, fd:16 #42 That's because situations of "normal working server" and "dead server we are trying to use again" are a bit different. -- Maxim Dounin http://nginx.com/support.html
on 2012-11-16 17:43
The requests before (for more than 30sec) to the server X are ok, this is the diet request generating a 500 response (from the timeout). Son up til this point all looks good - which is why I don't understand why nginx considers the server inactive after the first fail :) Posted at Nginx Forum: http://forum.nginx.org/read.php?2,232912,232919#msg-232919
on 2012-11-16 18:12
Hello! On Fri, Nov 16, 2012 at 11:42:54AM -0500, pliljenberg wrote: > The requests before (for more than 30sec) to the server X are ok, this is > the diet request generating a 500 response (from the timeout). > Son up til this point all looks good - which is why I don't understand why > nginx considers the server inactive after the first fail :) 500 response? Normally timeouts results in 504, and if you see 500 this might indicate that in fact request failed not due to a timeout, but e.g. due too loop detected. This in turn might mean that there were more than one request to the server X which failed. Try looking into error_log to see what's going on. -- Maxim Dounin http://nginx.com/support.html
on 2012-11-16 19:52
> Normally timeouts results in 504, and if you see 500 this might > indicate that in fact request failed not due to a timeout, but > e.g. due too loop detected. This in turn might mean that there > were more than one request to the server X which failed. > > Try looking into error_log to see what's going on. You're correct - its a 504. [16/Nov/2012:12:40:48 +0100] "POST /url HTTP/1.1" 403 454 Time: 300.030 Upstream-time: 300.004, 0.003 Upstream: XXX, YYY Upstream-status: 504, 403 Posted at Nginx Forum: http://forum.nginx.org/read.php?2,232912,232930#msg-232930
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.