Nginx + fair load balancer patch looping

An Nginx instance suddenly started spewing the following to its error
log at a rate of about 1GB/minute, and using a bit more CPU than
usual:

2008/02/29 10:33:47 [error] 16875#0: *126309461 upstream prematurely
closed connection while reading response header from upstream […]

Aside from the excessive logging, everything else seemed normal. Our
next_upstream setting is:

proxy_next_upstream error invalid_header;

Restarting Nginx fixed the problem.

Could this be the fair load balancer going hairwire?

Alexander.

On Fri, Feb 29, 2008 at 10:39:24AM +0100, Alexander S. wrote:

proxy_next_upstream error invalid_header;

Restarting Nginx fixed the problem.

Could this be the fair load balancer going hairwire?

nginx logs this when upstream has closed connection before sending
anything
or before it send full HTTP response header.

This is may be nginx bug, upstream failure, or kernel bug.

Could you truss failed process in next time ?

On Feb 29, 2008, at 11:39 AM, Alexander S. wrote:

proxy_next_upstream error invalid_header;

Restarting Nginx fixed the problem.

Could this be the fair load balancer going hairwire?

Alexander.

At Engine Y. we are seeing this happen as well, usually when a
backend goes down or a ‘clock skew’ message is logged. It gets logged
hundreds of thousands of times for each request.

Joshua Sierles
Engine Y.

On Fri, Feb 29, 2008 at 12:51:33PM +0200, Joshua Sierles wrote:

Aside from the excessive logging, everything else seemed normal. Our

At Engine Y. we are seeing this happen as well, usually when a
backend goes down or a ‘clock skew’ message is logged. It gets logged
hundreds of thousands of times for each request.

Oh, I have missed “fair load balancer patch”.
No, I’m not able to debug the bug.

On Fri, Feb 29, 2008 at 12:51:33PM +0200, Joshua Sierles wrote:

On Feb 29, 2008, at 11:39 AM, Alexander S. wrote:

An Nginx instance suddenly started spewing the following to its error
log at a rate of about 1GB/minute, and using a bit more CPU than
usual:

Could this be the fair load balancer going hairwire?

At Engine Y. we are seeing this happen as well, usually when a
backend goes down or a ‘clock skew’ message is logged. It gets logged
hundreds of thousands of times for each request.

Alexander, Joshua,

Please send me anything you know about this problem, especially how to
reproduce it. I’d appreciate if you sent me your config files (privately
and possibly anonymised, if you prefer).

Best regards,
Grzegorz N.

On Fri, Mar 14, 2008 at 12:29:41AM +0100, Alexander S. wrote:

Unfortunately I have no idea how to reproduce it. If the problem rears
its head again, is there anything I should do in terms of probing the
running Nginx process? Would a few minutes of running strace help, for
example?

If it doesn’t kill your I/O, it might be useful (as an insight into WTF
the fair balancer is choosing). I think that the most useful would be
capturing debug_http logs (only those with [upstream_fair] should be
enough), but that generates a truly massive amount of data.

A ‘strace -e connect’ should contain some useful data too, without
(hopefully) saturating the disk write bandwidth.

You could try the latest snapshot (or just apply
http://git.localdomain.pl/?p=nginx.git;a=commitdiff;h=e75e0a9c8cbad9b8f43848becbcc4d425cba3eae
by hand, it’s a one-liner). It should fix the problem, at least
regarding
dead backends. I have no idea really, how a clock skew comes into play
here.

Best regards,
Grzegorz N.

On 3/14/08, Grzegorz N. [email protected] wrote:

If it doesn’t kill your I/O, it might be useful (as an insight into WTF
the fair balancer is choosing). I think that the most useful would be
capturing debug_http logs (only those with [upstream_fair] should be
enough), but that generates a truly massive amount of data.

Well, it happened again. What is interesting – and I am pretty sure
this is what happened before – is that the errant process is actually
a worker shutting down:

root 4746 0.0 0.0 18316 1356 ? Ss Mar25 0:00
nginx: master process /usr/sbin/nginx
www-data 4301 56.6 34.4 6026976 2818648 ? R Mar27 1127:20 _
nginx: worker process is shutting down
www-data 17604 0.7 0.0 18984 2132 ? S 06:27 4:10 _
nginx: worker process
www-data 17605 0.7 0.0 19048 2228 ? S 06:27 4:12 _
nginx: worker process
www-data 17606 0.7 0.0 18824 2012 ? S 06:27 3:58 _
nginx: worker process
www-data 17607 0.7 0.0 18788 1960 ? S 06:27 4:10 _
nginx: worker process

I don’t know why it’s shutting down, though. It could be the log
rotation job that has poked it.

Running strace -e connect on this process yields an infinite sequence
of the following two lines:

connect(3, {sa_family=AF_INET, sin_port=htons(11003),
sin_addr=inet_addr(“…”)}, 16) = -1 EINPROGRESS (Operation now in
progress)
connect(4, {sa_family=AF_INET, sin_port=htons(11003),
sin_addr=inet_addr(“…”)}, 16) = -1 EINPROGRESS (Operation now in
progress)

where the address being connected to is one of the back ends. I also
ran a full strace, and I can send you the output privately if you
like.

I have not tried the latest snapshot yet. We are still running the one
from February 12th or so.

Alexander.

On 2/29/08, Grzegorz N. [email protected] wrote:

At Engine Y. we are seeing this happen as well, usually when a
backend goes down or a ‘clock skew’ message is logged. It gets logged
hundreds of thousands of times for each request.

Alexander, Joshua,

Please send me anything you know about this problem, especially how to
reproduce it. I’d appreciate if you sent me your config files (privately
and possibly anonymised, if you prefer).

Unfortunately I have no idea how to reproduce it. If the problem rears
its head again, is there anything I should do in terms of probing the
running Nginx process? Would a few minutes of running strace help, for
example?

Alexander.

Hi,

On Fri, Mar 28, 2008 at 03:39:54PM +0100, Alexander S. wrote:

On 3/14/08, Grzegorz N. [email protected] wrote:

If it doesn’t kill your I/O, it might be useful (as an insight into WTF
the fair balancer is choosing). I think that the most useful would be
capturing debug_http logs (only those with [upstream_fair] should be
enough), but that generates a truly massive amount of data.

where the address being connected to is one of the back ends. I also
ran a full strace, and I can send you the output privately if you
like.

Yes, please do.

I have not tried the latest snapshot yet. We are still running the one
from February 12th or so.

Please upgrade then, hopefully the problem is fixed now (at least I
haven’t heard of any new reports).

Best regards,
Grzegorz N.

On Fri, Mar 28, 2008 at 04:28:01PM +0100, Alexander S. wrote:

Please upgrade then, hopefully the problem is fixed now (at least I
haven’t heard of any new reports).

I will, thanks.

Looking at the strace (desperately trying to connect to a malfunctioning
backend – it closes the connection without sending a response) it
certainly looks like the bug I fixed in the last commit.

Here’s the patch, for reference:

Best regards,
Grzegorz N.

On 3/28/08, Grzegorz N. [email protected] wrote:

Here’s the patch, for reference:
Only choose peers without failures in the fair balancer fast path · gnosek/nginx-upstream-fair@93d5f64 · GitHub

Thanks for the fast response. I will tell our sysadmins to upgrade. :slight_smile:

Alexander.

On 3/28/08, Grzegorz N. [email protected] wrote:

Looking at the strace (desperately trying to connect to a malfunctioning
backend – it closes the connection without sending a response) it
certainly looks like the bug I fixed in the last commit.

Unfortunately, we are experiencing the problem even with the latest
snapshot. What can we do?

Alexander.

On 3/28/08, Grzegorz N. [email protected] wrote:

I have not tried the latest snapshot yet. We are still running the one
from February 12th or so.

Please upgrade then, hopefully the problem is fixed now (at least I
haven’t heard of any new reports).

I will, thanks.

Alexander.

On Wed, Apr 16, 2008 at 11:17:09PM +0200, Alexander S. wrote:

On 3/28/08, Grzegorz N. [email protected] wrote:

Looking at the strace (desperately trying to connect to a malfunctioning
backend – it closes the connection without sending a response) it
certainly looks like the bug I fixed in the last commit.

Unfortunately, we are experiencing the problem even with the latest
snapshot. What can we do?

That’s a good question unfortunately :wink:

I could use the debug_http output but that’s an insane amount of data.
One thing to try would be increasing the priority of all upstream_fair
messages, e.g. by running

sed -i
‘-es/ngx_log_debug[0-9]*(NGX_LOG_DEBUG_HTTP/ngx_log_error(NGX_LOG_EMERG/g’

ngx_http_upstream_fair_module.c

(not tested, but that’s the basic idea).

I can cook a patch if you aren’t comfortable messing with C code. This
will cause all debug messages to get logged at emerg level so you’ll be
able to collect it without enabling debug_http.

The upstream_fair module is pretty chatty too (several lines per
request) so the logs can grow but hopefully the amount of data will be
manageable.

Best regards,
Grzegorz N.

On Thu, Apr 17, 2008 at 04:17:05PM +0200, Grzegorz N. wrote:

ngx_http_upstream_fair_module.c
You could also collect a core dump of the failing process and send it to
me off-list together with a copy of your nginx binary:

gdb -p
gcore

Best regards,
Grzegorz N.

Alexander S. wrote:

Well, it happened again. What is interesting – and I am pretty sure
this is what happened before – is that the errant process is actually
a worker shutting down:

I ran into this problem also; same error about upstream mongrels and
rapidly increasing log file size. nginx was using as much cpu as it
could.

In our case, the particular upstream mongrel that was trying to be
contacted had died. A few minutes after restarting the mongrel the
problem stopped.

-Andy

Hi,

On czw, cze 05, 2008 at 08:33:36 +0200, Andy Verprauskus wrote:

I ran into this problem also; same error about upstream mongrels and
rapidly increasing log file size. nginx was using as much cpu as it
could.

In our case, the particular upstream mongrel that was trying to be
contacted had died. A few minutes after restarting the mongrel the
problem stopped.

Could you please contact me off-list (root&localdomain,pl)? I’d like to
know more about it because a backend simply dying should not confuse the
load balancer (it actually did at one time though, so please make sure
you’re running the latest snapshot).

Also, as a totally wild guess (both for you and Alexander), could you
try increasing the shared memory segment size? I.e. try putting

upstream_fair_shm_size 64k;

inside your http { } section. If the fair balancer runs out of shared
memory, it logs a message to that effect but it may have easily been
lost in the torrent of later messages.

Best regards,
Grzegorz N.