Question about http_stub_status

Hi,

we are using the “Stub Status Module” now for a couple of days, and
there is a something strange effect with the values from this module:

The first few hours the values for “Active connections” and “Waiting”
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn’t matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.

For what are these values standing for? What is happening there? BTW:
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)

Maybe there is one who can explain this behavior?

Thanks in advance,

Marcus

Hello!

On Fri, Aug 22, 2008 at 01:51:12PM +0200, Marcus Bianchy wrote:

we are using the “Stub Status Module” now for a couple of days, and
there is a something strange effect with the values from this module:

The first few hours the values for “Active connections” and “Waiting”
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn’t matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.

JFYI: you should rotate logfiles with SIGUSR1, not with SIGHUP. With
SIGHUP nginx re-configures itself, while with SIGUSR1 it does exactly
one
thing needed to rotate logfiles: it reopens logs.

For what are these values standing for? What is happening there? BTW:
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)

Maybe there is one who can explain this behavior?

Sudden changes in stub_status numbers usually coresponds to died or
killed nginx worker. Take a look at logs.

Maxim D.

On Fri, Aug 22, 2008 at 01:51:12PM +0200, Marcus Bianchy wrote:

we are using the “Stub Status Module” now for a couple of days, and
there is a something strange effect with the values from this module:

The first few hours the values for “Active connections” and “Waiting”
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn’t matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.

nginx has special rotation signal: SIGUSR1.

For what are these values standing for? What is happening there? BTW:
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)

Maybe there is one who can explain this behavior?

This means that either someone killed nginx workers using
SIGTERM/INT/KILL
or workers exited abnornamally. Could you run

grep alert error_log

?

On Sat, Aug 23, 2008 at 01:18:07PM +0200, Marcus Bianchy wrote:

2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.

Maybe it’s a good idea to allow core dumps to exactly reproduce what
causes these signals?

It seems that you have “max_fails=0” in some upstream.
The recent Maxim’s patch fixes the bug or you may try nginx-0.7.12.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Igor,

thank you for the fast answer.

Igor S. schrieb:

nginx has special rotation signal: SIGUSR1.
Ok, I will fix that. I just called our /etc/init.d script (which does
an SIGHUP) after logrotate. It’s a better idea to send a USR1 from
logrotate direct to the master process. Or much better: Chenage the
init-Script to do that.

This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run

grep alert error_log
Well, I can say that no one of our team send’s such signals around…
We’re observing strange signal 8 (SIGFPE) errors the last time:
A typical “grep/zgrep signal” of our error.logs shows things similar
like this:

############ snip ############
2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on
signal 8
2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on
signal 8
2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal
8
2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on
signal 8
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on
signal 8
2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on
signal 8
2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received
################## snip #############

The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.

Maybe it’s a good idea to allow core dumps to exactly reproduce what
causes these signals?

Marcus Bianchy

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIr/HvbraE9AYT0pERAqDmAKCRI/WfiZxMGwzyHZlrBGNHxMXa8ACeMHA1
qILe8PxrIDVNY16Ihwg9wyk=
=wXFe
-----END PGP SIGNATURE-----

On Tue, Aug 26, 2008 at 08:53:47PM +0200, Marcus Bianchy wrote:

#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;

I will discuss the further steps with my colleagues tomorrow. Maybe
we’re trying Maxim’s patch or trying nginx 0.7.13.

Yes, this is the same bug.

Igor S. schrieb:

It seems that you have “max_fails=0” in some upstream.
The recent Maxim’s patch fixes the bug or you may try nginx-0.7.12.

Hi Igor,

thanks for the update.

Meanwhile I have 2 coredumps with the FPE. And guess what I’ve found:

#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;

I will discuss the further steps with my colleagues tomorrow. Maybe
we’re trying Maxim’s patch or trying nginx 0.7.13.

Marcus Bianchy

http://dict.leo.org/ende?lp=ende&p=thMx..&search=colleagues