Nginx -s reload fails / workers in process-state H

Hello,

using nginx-0.8.40 on Linux 2.6.32 I’ve encountered a problem with
nginx reloading, which leaves [nginx] processes in process state “H”
behind. As soon as the first of these processes is left behind, nginx
is unable to reload the workers and is unable to stop gracefully.
(happens with 0.8.36 and 0.8.37 too)

Sending signals to the H-state processes won’t do anything, leaving no
other choice but to SIGKILL them.

This behavior is very seldom and I’ve tried to track it down while
having nginx logging with log level debug. It finally happened after
reloading nginx the 68th time.

The system was completely idle during the tests.

Note that I’ve first encountered this while using cron to reload the
nginx configuration every 5 minutes. I’m not entirely sure if this is
triggered by some kind of race-condition, where the last nginx reload
was not completely finished and the next one is issued on top.
I doubt it though: The debug log I’ve created shows the full process
list after a reload. AFAIK nginx changes some proctitles while in
reload-state, which I could not find anymore just seconds after issuing
the reload signal. In addition, the system was idle (no requests…), so
the old workers should fade in no time.

The command used by cron to reload nginx is:

nginx -t && nginx -s reload

Steps to reproduce it (after stopping Cron and nginx):

Step 1: Start up nginx as usual, eg. /etc/init.d/nginx start

Step 2: Run the following script to issue the reload signal

every 20 seconds:

------------------------------- snip ---------------------------------
#!/bin/bash

log=/var/log/nginx/error_log

echo “XXXXXXXXXX ------- SHOWING INITIAL NGINX PROCESSES -------
XXXXXXXXXX” >> “$log”
ps auxf 2>&1 | grep nginx >> “$log”
echo “XXXXXXXXXX ------- END OF INITIAL NGINX PROCESSES -------
XXXXXXXXXX” >> “$log”

i=1
while sleep 10
do
echo “XXXXXXXXXX ------- RELOAD $i ------- XXXXXXXXXX” >> “$log”

nginx -t && nginx -s reload
sleep 10

echo "XXXXXXXXXX ------- RELOAD $i complete - SHOWING NGINX

PROCESSES ------- XXXXXXXXXX" >> “$log”
ps auxf 2>&1 | grep nginx >> “$log”
echo “XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX”

“$log”

((i++))
done
------------------------------- snap --------------------------------

As far as I understand, the reload should be complete when the process
list is appended to the error_log 10 seconds after reloading nginx
and it does not contain any proctitles like “nginx: reloading
workers…”.

Step 3: watch -n1 ps auxf … and wait until a nginx process in

state “H” appears.

Step 4: CTRL-C the debug script, stop nginx and SIGKILL all “H”-state

processes.

Step 5: Examine the error_log.

This debug log is about 21M in size, gzipped down to 1M. I’ve uploaded
it here to not pollute the mailing list:

http://biz.baze.de/files/nginx-debug.log.gz

The first appearance of an “H”-state process is after reload 68 on line
number 300630. From there on, the nginx reload is broken and more of
these processes accumulate.

To find the next marker, just search for “XXXXXXXXXX”. Search for
" H " (leading and trailing space) to find the next process list with
such a process.

For the record, here is the process list after reload 68, with broken
process (pid 11488).

XXXXXXXXXX ------- RELOAD 68 complete - SHOWING NGINX PROCESSES -------
XXXXXXXXXX
root 30864 0.0 0.0 12780 1344 pts/9 S+ 13:27 0:00
_ /bin/bash ./nginx-debug.sh
root 11894 0.0 0.0 9800 860 pts/9 S+ 13:50
0:00 _ grep nginx
root 11895 0.0 0.0 12780 496 pts/9 S+ 13:50
0:00 _ /bin/bash ./nginx-debug.sh
root 30776 0.1 0.0 101668 11332 ? Ss 13:27 0:02 nginx:
master process /usr/sbin/nginx -c /etc/nginx/n…
nginx 11488 0.0 0.0 0 0 ? H 13:49 0:00 _
[nginx]
nginx 11835 0.0 0.0 101668 11876 ? S 13:50 0:00 _
nginx: worker process
nginx 11836 0.0 0.0 101668 12096 ? S 13:50 0:00 _
nginx: worker process
nginx 11837 0.0 0.0 101668 11008 ? S 13:50 0:00 _
nginx: cache manager process
nginx 11838 0.0 0.0 101668 10988 ? S 13:50 0:00 _
nginx: cache loader process
XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX

Here is the configuration:

------------------------------- snip --------------------------------
user nginx nginx;
daemon on;

worker_processes 2;

events {
worker_connections 5000;
use epoll;
}

[ debug | info | notice | warn | error | crit ]

error_log /var/log/nginx/error_log debug;

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

# Server config
server_tokens           on;
ignore_invalid_headers  on;

sendfile         on;
tcp_nopush       on;
tcp_nodelay      on;

index index.html index.htm;

# Autoindex
autoindex off;
autoindex_exact_size off;
autoindex_localtime on;

# GZIP configuration
gzip            on;
gzip_types      text/plain application/x-javascript text/xml
text/css; gzip_disable    "MSIE [1-6]\.(?!.*SV1)";

# Rate Limiting
limit_rate_after 20;
limit_rate 2m;

# File Cache
open_file_cache max=2000 inactive=60s;
open_file_cache_valid    60s;
open_file_cache_errors   on;

# Proxy Setup
proxy_redirect              off;
server_name_in_redirect     off;
proxy_max_temp_file_size    16m;
proxy_intercept_errors      off;
client_max_body_size        10m;
client_body_buffer_size     128k;

# Proxy Cache
proxy_buffering         on;
proxy_buffers           32 4k;

proxy_cache_path        /var/cache/nginx
                        levels=2:2:2
                        keys_zone=cache:64m
                        inactive=600
                        max_size=4096m;

proxy_cache_key         "$host$request_uri";
proxy_cache_valid       200 301 302 404  1m;

# Proxy Headers
proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;

# TIMEOUTS
client_header_timeout   30;
client_body_timeout     30;
send_timeout            30;

proxy_connect_timeout   10;
proxy_send_timeout      60;
proxy_read_timeout      60;

keepalive_timeout       0;

# DEFAULT SERVER
server {
    listen 80 default;
    server_name _;

    index index.html;
    root  /var/www;

    open_file_cache off;

}

include /etc/nginx/upstreams-php.d/*;
include /etc/nginx/upstreams-wsgi.d/*;
include /etc/nginx/vhosts.d/*;

}
------------------------------- snap --------------------------------

The included files are upstream{} and server{} blocks with valid
syntax. They do not enable/configure any additional module.

$ nginx -V
nginx version: nginx/0.8.40
configure arguments: --prefix=/usr --sbin-path=/usr/sbin/nginx
–conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error_log --pid-path=/var/run/nginx.pid
–lock-path=/var/lock/nginx.lock --user=nginx --group=nginx
–with-cc-opt=-I/usr/include --with-ld-opt=-L/usr/lib
–http-log-path=/var/log/nginx/access_log
–http-client-body-temp-path=/var/tmp/nginx/client
–http-proxy-temp-path=/var/tmp/nginx/proxy
–http-fastcgi-temp-path=/var/tmp/nginx/fastcgi
–with-http_realip_module --without-http_charset_module
–without-http_ssi_module --without-http_userid_module
–without-http_geo_module --without-http_map_module
–without-http_split_clients_module --without-http_referer_module
–without-http_fastcgi_module --without-http_memcached_module
–without-http_empty_gif_module --without-http_browser_module
–without-http_uwsgi_module --with-debug

Let me know if you need anything else.

Best regards,
John

On Wed, Jun 09, 2010 at 03:46:41PM +0200, John Feuerstein wrote:

I doubt it though: The debug log I’ve created shows the full process
Steps to reproduce it (after stopping Cron and nginx):

done

Step 4: CTRL-C the debug script, stop nginx and SIGKILL all “H”-state

number 300630. From there on, the nginx reload is broken and more of
XXXXXXXXXX
nginx 11835 0.0 0.0 101668 11876 ? S 13:50 0:00 _
nginx: worker process
nginx 11836 0.0 0.0 101668 12096 ? S 13:50 0:00 _
nginx: worker process
nginx 11837 0.0 0.0 101668 11008 ? S 13:50 0:00 _
nginx: cache manager process
nginx 11838 0.0 0.0 101668 10988 ? S 13:50 0:00 _
nginx: cache loader process
XXXXXXXXXX ------- END OF NGINX PROCESSES ------- XXXXXXXXXX

Thank you for the detailed report.
It seems that “H” statted is a rebranded “Z” state.
I see the following in the log:

Two workers exit:

2010/06/09 13:50:03 [notice] 11487#0: exit
2010/06/09 13:50:03 [notice] 11488#0: exit

A master received the only SIGCHLD signal:

2010/06/09 13:50:03 [notice] 30776#0: signal 17 (SIGCHLD) received
2010/06/09 13:50:03 [notice] 30776#0: worker process 11487 exited with
code 0
2010/06/09 13:50:03 [debug] 30776#0: wake up, sigio 0
2010/06/09 13:50:03 [debug] 30776#0: reap children
2010/06/09 13:50:03 [debug] 30776#0: child: 0 11835 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 1 11836 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 2 11837 e:0 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 3 11838 e:0 t:0 d:0 r:0 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 4 11487 e:1 t:1 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487
to:11835
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487
to:11836
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487
to:11837
2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487
to:11838

The master is even able to send a message without error to already
finished 11488:

2010/06/09 13:50:03 [debug] 30776#0: pass close channel s:4 pid:11487
to:11488
2010/06/09 13:50:03 [debug] 30776#0: child: 5 11488 e:1 t:0 d:0 r:1 j:0
2010/06/09 13:50:03 [debug] 30776#0: child: 6 -1 e:1 t:1 d:0 r:1 j:0

Then the master calls sigsuspend():

2010/06/09 13:50:03 [debug] 30776#0: sigsuspend

and waits forever: it did not receive 11488’s SIGCHLD to call waitpid()
and to escape 11488 from zombie state, it did not receive any SIGHUP
further.

I believe something is broken in Linux 2.6.32 signal delivery.


Igor S.
http://sysoev.ru/en/