Hello, I’ve found that when there are upstream servers unavailable in
my upstream group, applying a little bit of load on the server (i.e.,
just myself browsing around quickly, 2-3 req/s max) results in the
following errors even for upstream servers that are available and
well:
2013/04/04 22:02:21 [error] 4211#0: *2898 writev() failed (134:
Transport endpoint is not connected) while sending request to
upstream, client: 184.94.54.70, server: , request: “GET /api/ui/skin
HTTP/1.1”, upstream: “http://10.112.5.119:2001/api/ui/skin”, host:
“mysite.org”, referrer: “http://mysite.org/search”
In this particular example, I have 4 upstreams, 3 servers are shut
down (all except 10.112.5.119). If I comment out the 3 other upstream
servers, I cannot reproduce this error.
Running SmartOS (Joyent cloud)
$ nginx -v
nginx version: nginx/1.3.14
These are things I tried to no avail:
- I used to have keepalive 64 on the upstream, I removed it
- Nginx used to run as a non-privileged user, I switched it to root
(prctl reports that privileged users should have 65,000 nofiles
allowed) - I used to have worker_processes set to 5, I increased it to 16
- The upstream server configuration used to not have max_fails or
max_timeout, I added those in trying to limit the amount of times
nginx tried to access the downed upstream servers - I used to have the proxy_connect_timeout unspecified so it should
have defaulted to 60s, I tried setting it to 1s - I tried commenting out all the rate-limiting directives
The URLs I’m hitting in my tests are all those for the “tenantworkers”
upstream.
Any idea? I would think I probably have a resource limit issue, or an
issue with the back-end server, but it just doesn’t make sense that
everything is OK after I comment out the downed upstreams. My concern
is that the system will crumble under real load when even 1 upstream
becomes unavailable.
Thanks,
Branden