Socket connection failures on 1.6.1~precise

addis_a · September 2, 2014, 6:00pm

I’m trying to track down an issue that is being presented only when I
run nginx version 1.6.1-1~precise. My nodes running 1.6.0-1~precise do
not display this issue, but freshly created servers are getting floods
of these socket connection issues a couple times a day.

/connect() to unix:/tmp/unicorn.sock failed (11: Resource temporarily
unavailable) while connecting to upstream/

The setup I’m working with is nginx proxying requests to a unicorn
socket powered by a ruby app. As stated above, the error is NOT present
on nodes running 1.6.0-1~precise, but any newly created node gets the
newer 1.6.1-1~precise package installed and will inevitably have that
error.

All settings from nodes running 1.6.0 appear to be the same as newly
created nodes on 1.6.1 in terms of sysctl settings, nginx settings, and
unicorn settings. All package versions are the same except for nginx.
When I downgraded one of the newly created nodes to nginx 1.6.0 using
the nginx ppa (ref:
NGINX Stable : “Nginx” team), the error was not
present.

Is there any advice, direction, or similar issue experienced that
someone else might be able to help me track this down?

Jon_Clayton · September 2, 2014, 9:15pm

Hello!

On Tue, Sep 02, 2014 at 11:00:10AM -0500, Jon Clayton wrote:

running 1.6.0-1~precise, but any newly created node gets the newer
Is there any advice, direction, or similar issue experienced that someone
else might be able to help me track this down?

Just some information:

In nginx itself, the difference between 1.6.0 and 1.6.1 is fairy
minimal. The only change affecting http is one code line added
in the 400 Bad Request handling code
(see http://hg.nginx.org/nginx/rev/b8188afb3bbb).
The message suggests that backend’s backlog is full. This can
easily happen on load spikes and/or if a backend is overloaded,
and usually unrelated to the nginx itself.

–
Maxim D.
http://nginx.org/

Jon_Clayton · September 2, 2014, 10:36pm

I did see the changelog hadn’t noted many changes and running a diff of
the versions shows what you mentioned regarding the 400 bad request
handling code. I’m not necessarily stating that nginx is the problem,
but it would seem like something had changed enough to cause the
backend’s backlog to fill more rapidly.

That could be a completely bogus statement as I’ve been attempting to
find a way to track down exactly what backlog is being filled, but my
test of downgrading nginx back to 1.6.0 from the nginx ppa seemed to
also point at a change in nginx causing the issue since the errors did
not persist after downgrading.

It’s very possible that I’m barking up the wrong tree, but the fact that
only changing nginx versions back down to 1.6.0 from 1.6.1 eliminated
the errors seems suspicious. I’ll keep digging, but I’m open to any
other suggestions.

Jon_Clayton · September 10, 2014, 5:16am

Just closing the loop on this, but what appeared to be happening was
that newly created nodes were not having the nginx master PID start up
with a custom ulimit set in /etc/security/limits.d/. The workers were
all fine since the worker_rlimit_nofile was set in the nginx.conf, but I
was running into a separate issue that was preventing nginx from
inheriting the custom ulimit setting for that master PID file.

Truth be told, I never quite nailed down an exact RCA other than
ensuring the nginx master PID came up with the custom ulimit setting.
That would seem to indicate something was causing a spike in the number
of open files for the master PID, but I can look into that separately.