Nginx 100% cpu and it may be a nginx bug

The function ngx_http_upstream_get_round_robin_peer in nginx will loop
for a long period of times when pc->tries equals to 0.

Well then, under what conditions does pc->tries equal to zero?

When executing ngx_http_upstream_get_round_robin_peer for the first
time, in some cases it returns NGX_OK while pc->tries equals to 1.

When in ngx_http_upstream_process_header, nginx process may enter the
following:
1524 if (n == NGX_ERROR || n == 0) {
1525 ngx_http_upstream_next(r, u,
NGX_HTTP_UPSTREAM_FT_ERROR);
1526 return;
1527 }

then enter ngx_http_upstream_next function

In the ngx_http_upstream_next,nginx process will call
ngx_http_upstream_free_round_robin_peer

2796 if (ft_type != NGX_HTTP_UPSTREAM_FT_NOLIVE) {
2797 u->peer.free(&u->peer, u->peer.data, state);
2798 }

we now enter ngx_http_upstream_free_round_robin_peer.

Pay attention to the following codes in the function:
if (pc->tries) {
pc->tries–;
}

After executing the above, pc->tries changes from 1 to 0.

then the process returns to ngx_http_upstream_next

2892 ngx_http_upstream_connect(r, u);
2893 }

we enter ngx_http_upstream_connect here:

this function will call

1081 rc = ngx_event_connect_peer(&u->peer);

It finally enter ngx_http_upstream_get_round_robin_peer again. And then
the process will loop for a long time which is unpredictable.

Posted at Nginx Forum:

we find all processes of nginx are all nearly cpu 100%.

we used tcpcopy to find this problem.

Posted at Nginx Forum:

we use gcore to generate a core file.

the backtrace is as the following:

#0 0x000000000046ae0b in ngx_http_upstream_get_round_robin_peer
(pc=0x1b822678, data=0x1b826e10)
at src/http/ngx_http_upstream_round_robin.c:482
#1 0x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer
(pc=0x1b822678, data=0x1b826de0)
at
…/ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#2 0x0000000000429fab in ngx_event_connect_peer (pc=0x1b822678) at
src/event/ngx_event_connect.c:24
#3 0x000000000046219c in ngx_http_upstream_connect (r=0x1b821230,
u=0x1b822668) at src/http/ngx_http_upstream.c:1103
#4 0x0000000000465f61 in ngx_http_upstream_next (r=0x1b821230,
u=0x1b822668, ft_type=2)
at src/http/ngx_http_upstream.c:2900
#5 0x00000000004631c0 in ngx_http_upstream_process_header
(r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1547
#6 0x0000000000462d8c in ngx_http_upstream_send_request (r=0x1b821230,
u=0x1b822668) at src/http/ngx_http_upstream.c:1414
#7 0x00000000004625de in ngx_http_upstream_connect (r=0x1b821230,
u=0x1b822668) at src/http/ngx_http_upstream.c:1204
#8 0x000000000046129d in ngx_http_upstream_init_request (r=0x1b821230)
at src/http/ngx_http_upstream.c:631
#9 0x0000000000460a6d in ngx_http_upstream_init (r=0x1b821230) at
src/http/ngx_http_upstream.c:432
#10 0x000000000045782b in ngx_http_read_client_request_body
(r=0x1b821230, post_handler=0x460989 <ngx_http_upstream_init>)
at src/http/ngx_http_request_body.c:154
#11 0x00000000004a0875 in ngx_hmux_handler (r=0x1b821230) at
…/hmux/ngx_hmux_module.c:673
#12 0x000000000043ffd1 in ngx_http_core_content_phase (r=0x1b821230,
ph=0x1b5d4f88) at src/http/ngx_http_core_module.c:1365
#13 0x000000000043ebf5 in ngx_http_core_run_phases (r=0x1b821230) at
src/http/ngx_http_core_module.c:861
#14 0x000000000043eb71 in ngx_http_handler (r=0x1b821230) at
src/http/ngx_http_core_module.c:844
#15 0x000000000044c9a1 in ngx_http_process_request (r=0x1b821230) at
src/http/ngx_http_request.c:1665
#16 0x000000000044b34d in ngx_http_process_request_headers
(rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:1111
#17 0x000000000044ab26 in ngx_http_process_request_line
(rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:911
#18 0x0000000000449d7d in ngx_http_init_request (rev=0x2b3f10a3f4f0) at
src/http/ngx_http_request.c:518
#19 0x0000000000434ea7 in ngx_epoll_process_events (cycle=0x1b5ae090,
timer=1352, flags=1)
at src/event/modules/ngx_epoll_module.c:678
#20 0x0000000000426480 in ngx_process_events_and_timers
(cycle=0x1b5ae090) at src/event/ngx_event.c:245
—Type to continue, or q to quit—
#21 0x0000000000432d0c in ngx_worker_process_cycle (cycle=0x1b5ae090,
data=0x0) at src/os/unix/ngx_process_cycle.c:801
#22 0x000000000042faad in ngx_spawn_process (cycle=0x1b5ae090,
proc=0x432b60 <ngx_worker_process_cycle>, data=0x0,
name=0x57450e “worker process”, respawn=-3) at
src/os/unix/ngx_process.c:196
#23 0x0000000000431c12 in ngx_start_worker_processes (cycle=0x1b5ae090,
n=8, type=-3) at src/os/unix/ngx_process_cycle.c:360
#24 0x00000000004312fe in ngx_master_process_cycle (cycle=0x1b5ae090) at
src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403c8c in main (argc=1, argv=0x7fffba823048) at
src/core/nginx.c:405
(gdb) info lo
rrp = 0x1b826e10
now = 1321860194
m = 8
rc = 461502464
i = 18446744061305903636

n = 0
c = 0x40
peer = 0x1b5cf2c0
peers = 0x1

(gdb) p *rrp
$5 = {peers = 0x1b5cf230, current = 3, tried = 0x1b826e28, data = 4}
(gdb) p rrp->peers->peer[0]
$6 = {sockaddr = 0x1b5c63b8, socklen = 16, name = {len = 19, data =
0x1b5c63c8 “xxx.xxx.xxx.xxx:36901server”},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193,
max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer[1]
$7 = {sockaddr = 0x1b5c6418, socklen = 16, name = {len = 19, data =
0x1b5c6428 “xxx.xxx.xxx.xxx:36902server”},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194,
max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer[2]
$8 = {sockaddr = 0x1b5c6478, socklen = 16, name = {len = 19, data =
0x1b5c6488 “xxx.xxx.xxx.xxx:36903server”},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860194,
max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}
(gdb) p rrp->peers->peer[3]
$9 = {sockaddr = 0x1b5c64d8, socklen = 16, name = {len = 19, data =
0x1b5c64e8 “xxx.xxx.xxx.xxx:36904keepalive”},
current_weight = 0, weight = 1, fails = 1, accessed = 1321860193,
max_fails = 1, fail_timeout = 10, down = 0,
ssl_session = 0x0}

Posted at Nginx Forum:

nginx-0.8.53 has the same problem

Posted at Nginx Forum:

Hello!

On Tue, Nov 22, 2011 at 05:38:11AM -0500, wangbin579 wrote:

1524 if (n == NGX_ERROR || n == 0) {
2796 if (ft_type != NGX_HTTP_UPSTREAM_FT_NOLIVE) {

After executing the above, pc->tries changes from 1 to 0.

then the process returns to ngx_http_upstream_next

2892 ngx_http_upstream_connect(r, u);
2893 }

The ngx_http_upstream_next() will finalize request if tries == 0.

You are probably facing the problem with keepalive backend
connections in 1.0.x (where they aren’t officially supported). It
is fixed with this changeset while introducing official keepalive
backend connections support in 1.1.4:

http://trac.nginx.org/nginx/changeset/4116/nginx

Maxim D.

[root@208 sbin]# /home/nginx/sbin/nginx -V
nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.1.2 20080704 (Red Hat 4.1.2-48)
nginx: TLS SNI support enabled
nginx: configure arguments: --prefix=/home/nginx --user=nginx
–group=nginx --without-select_module --without-poll_module
–with-http_ssl_module --with-http_realip_module
–with-http_gzip_static_module --with-http_stub_status_module
–without-http_fastcgi_module --without-http_memcached_module
–without-http_empty_gif_module --with-pcre=…/pcre-7.9
–with-openssl=…/openssl-0.9.8r --add-module=…/hmux
–add-module=…/ngx_http_upstream_keepalive-c6396fef9295 --with-debug

part of nginx conf file:

events {
worker_connections 8192;
use epoll;
epoll_events 4096;
accept_mutex off;
}


upstream yyy {
server xxx.xxx.xxx.xxx:36901;
server xxx.xxx.xxx.xxx:36902;
server xxx.xxx.xxx.xxx:36903;
server xxx.xxx.xxx.xxx:36904;

    keepalive 1024;
}

Posted at Nginx Forum: