Cache process manager consume a lot of CPU at reload and has to be restarted

luislavena · April 1, 2011, 11:43am

Hello,

I had an issue last night with a 0.8.54 nginx instance. Around 1AM, a
HUP
signal is sent to log rotation. This nginx instance is set up as a
reverse
proxy to an apache/php and it does some cache. Here is the proxy_cache
related conf I’ve set up.

proxy_cache_path /CACHE levels=1:2:2 keys_zone=cache:256m inactive=1d
max_size=20g;
proxy_cache cache;

Last night, the cache process manager has restarted well, but it has
consumed all CPU available on the server. Moreover, the SAR command
indicates that it did not seek, read or write on the disks (as the disk
usage remains almost null). Our exploitation team had to restart nginx
40
minutes later and it solves the problem.

Has someone already experience such an issue ? Do you know why this
could
happened ?

Thx you
++ Jerome

PS: here is the related error_log:

2011/04/01 01:04:55 [notice] 25206#0: signal 1 (SIGHUP) received,
reconfiguring
2011/04/01 01:04:55 [notice] 25206#0: reconfiguring
2011/04/01 01:04:55 [notice] 25206#0: using the “epoll” event method
2011/04/01 01:04:55 [notice] 25206#0: start worker processes
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10450
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10451
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10452
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10453
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10454
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10455
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10456
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10457
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10458
2011/04/01 01:04:55 [notice] 25206#0: start worker process 10459
2011/04/01 01:04:55 [notice] 25206#0: start cache manager process 10460
2011/04/01 01:04:55 [notice] 25206#0: signal 17 (SIGCHLD) received
2011/04/01 01:04:55 [notice] 25206#0: worker process 7164 exited with
code 0
2011/04/01 01:04:55 [notice] 25206#0: worker process 7166 exited with
code 0
2011/04/01 01:04:55 [notice] 25206#0: worker process 30018 exited with
code
0
2011/04/01 01:04:55 [notice] 25206#0: signal 17 (SIGCHLD) received
2011/04/01 01:04:55 [notice] 25206#0: worker process 7168 exited with
code 0
2011/04/01 01:04:55 [notice] 25206#0: worker process 7170 exited with
code 0
2011/04/01 01:04:55 [notice] 25206#0: worker process 7171 exited with
code 0
2011/04/01 01:04:56 [notice] 25206#0: signal 29 (SIGIO) received
2011/04/01 01:04:56 [notice] 25206#0: signal 29 (SIGIO) received
2011/04/01 01:05:26 [notice] 25206#0: signal 17 (SIGCHLD) received
2011/04/01 01:05:26 [notice] 25206#0: worker process 7165 exited with
code 0
2011/04/01 01:05:26 [notice] 25206#0: signal 29 (SIGIO) received
2011/04/01 01:45:55 [notice] 25206#0: signal 15 (SIGTERM) received,
exiting
2011/04/01 01:45:55 [notice] 25206#0: signal 14 (SIGALRM) received
2011/04/01 01:45:55 [notice] 25206#0: signal 14 (SIGALRM) received
…

SSJSSrSSme_LoyetSS_S · April 1, 2011, 11:56am

Hello!

On Fri, Apr 01, 2011 at 11:42:44AM +0200, Jérôme Loyet wrote:

Last night, the cache process manager has restarted well, but it has
consumed all CPU available on the server. Moreover, the SAR command
indicates that it did not seek, read or write on the disks (as the disk
usage remains almost null). Our exploitation team had to restart nginx 40
minutes later and it solves the problem.

Has someone already experience such an issue ? Do you know why this could
happened ?

Yesterday similar issue was discussed on russian mailing list,
here:

http://nginx.org/pipermail/nginx-ru/2011-March/040396.html

Cache manger was spinning in ngx_spinlock() waiting for cache lock
to be released. Cache lock was never released as it was hold by
worker process which caught SIGSEGV while holding cache lock.

Maxim D.

SSJSSrSSme_LoyetSS_S · April 1, 2011, 12:07pm

I had a similar issue 2 hours before when 2 worker processes segfault.
As
I’m using a custom mudule, I didn’t link the two events.

2011/4/1 Maxim D. [email protected]

SSJSSrSSme_LoyetSS_S · April 5, 2011, 7:44pm

2011/4/1 Jrme L. [email protected]:

Hi,

I had a similar issue 2 hours before when 2 worker processes segfault. As
I’m using a custom mudule, I didn’t link the two events.

I am seeing this issue as well. Based on my initial tests, the issue
seems to manifest itself easily with the following approach:

Configure a simple reverse proxy setup with caching enabled
Warm up the cache with some data
Shut down the upstreams so requests from clients start to timeout
(upstream must not answer on the IP level)
Send SIGHUP to master process few times.

After a few retries, nginx cache manager will go into an infinite loop
and start consuming 100% CPU. If the SIGHUP exercise is repeated, you
will also end up with more than one cache manager process. Then, you
will start seeing stuff like this in your error log:

2011/04/04 13:08:28 [crit] 5036#0: unlink()
“/cache/static/3/4/4b/3563f09ec1e13b6f479efb8db6ea94b4” failed (2: No
such file or directory)

I will try to get a gdb trace of running wild process.

How can I help find the root cause of this problem? It’s pretty vital
that you can reliably reload configuration.

Thanks,
Srebrenko

SSJSSrSSme_LoyetSS_S · April 6, 2011, 12:03pm

I too notice this on one of my servers occasionally.

2011/4/6 Srebrenko Šehić [email protected]

SSJSSrSSme_LoyetSS_S · April 6, 2011, 8:18pm

Hello!

On Wed, Apr 06, 2011 at 05:55:58PM +0200, Srebrenko Šehić wrote:

nginx: cache manager process (nginx)

kdump -R | head -50

28055 nginx 1302104178.942729 EMUL “native”
28055 nginx 0.000051 RET gettimeofday 0
28055 nginx 0.000031 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000011 RET gettimeofday 0

No, this isn’t not in spinlock. While looping in spinlock nginx
only runs and calls sched_yeild(), nothing more. This is another
problem.

[…]

I also found out that setting “proxy_cache_path … max_size=10k” (a
ridiculously small number) while there is traffic on the virtualhost
triggers the problem even without nginx receiving a SIGHUP.

I was able to reproduce the problem. With max_size=0 it starts
hogging cpu right on startup.

Try the attached patch. It’s against 0.9.7, but probably will
apply cleanly or with minor modifications to 0.8.* as well.

Maxim D.

SSJSSrSSme_LoyetSS_S · April 7, 2011, 9:36pm

On Wed, Apr 6, 2011 at 8:17 PM, Maxim D. [email protected] wrote:

Hi,

I was able to reproduce the problem. With max_size=0 it starts
hogging cpu right on startup.

Try the attached patch. It’s against 0.9.7, but probably will
apply cleanly or with minor modifications to 0.8.* as well.

Thanks a bunch Maxim! I applied the patch to my 0.8.50 installation.
Problem solved. No more cache manager hogging the CPU.

Igor, can we get this committed upstream?

Thanks,
Srebrenko

SSJSSrSSme_LoyetSS_S · May 19, 2011, 9:51pm

Unfortunately, it still happens also for 1.0.2 release.

I have seen it over 10 times on two different servers since upgrade to
1.0.0, then 1.0.1 and 1.0.2.

Not sure how to debug this, as it is random issue, yet causes downtime
until our auto-healing will kill all nginx processes and wait for load
to stabilize to start nginx again.

Posted at Nginx Forum:

SSJSSrSSme_LoyetSS_S · April 6, 2011, 5:56pm

2011/4/5 Srebrenko Šehić [email protected]:

I will try to get a gdb trace of running wild process.

Some more info about this problem. I had cache manager in a spinlock
using 100% CPU. I did a ktrace (this is on OpenBSD 4.8/amd64 running
0.8.50) and here is the output (edited for brevity):

ps auxww | grep 28055 | grep -v grep

_proxy 28055 99.1 1.9 646012 39096 ?? R/0 2:28PM 75:07.06
nginx: cache manager process (nginx)

kdump -R | head -50

28055 nginx 1302104178.942729 EMUL “native”
28055 nginx 0.000051 RET gettimeofday 0
28055 nginx 0.000031 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000011 RET gettimeofday 0
28055 nginx 0.000204 CALL getpid()
28055 nginx 0.000008 RET getpid 28055/0x6d97
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000010 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000020 RET gettimeofday 0
28055 nginx 0.000040 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000009 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000015 CALL getpid()
28055 nginx 0.000004 RET getpid 28055/0x6d97
28055 nginx 0.000019 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000007 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000007 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000007 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000015 CALL getpid()
28055 nginx 0.000011 RET getpid 28055/0x6d97
28055 nginx 0.000018 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000027 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000016 CALL getpid()
28055 nginx 0.000003 RET getpid 28055/0x6d97
28055 nginx 0.000018 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000007 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000007 RET gettimeofday 0
28055 nginx 0.000026 CALL gettimeofday(0x7f7ffffea2a0,0)
28055 nginx 0.000006 RET gettimeofday 0

As you can see, nginx is stuck at calling gettimeofday() and getpid()
in an endless loop. Filtering those 2 syscalls in the kdump output
gives this:

kdump -R | grep -v gettime | grep -v getpid

28055 nginx 1302104178.942729 EMUL “native”
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000009 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000012 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000011 RET __sysctl 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000017 RET __sysctl 0
28055 nginx 0.000018 CALL sched_yield()
28055 nginx 0.000013 RET sched_yield 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000015 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000022 RET __sysctl 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000015 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000017 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000015 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000007 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000014 RET __sysctl 0
28055 nginx 0.000008 CALL
__sysctl(1.37,0x7f7ffffea130,0x7f7ffffea118,0,0)
28055 nginx 0.000013 RET __sysctl 0

Something is “rotten” in the cache manager.

I also found out that setting “proxy_cache_path … max_size=10k” (a
ridiculously small number) while there is traffic on the virtualhost
triggers the problem even without nginx receiving a SIGHUP.

Here is a gdb trace (no debugging symbols - I will try to get that):

(gdb) attach 28055
Attaching to program: /usr/local/sbin/nginx, process 28055
…
…
0x00000002072246fa in getpid () from /usr/lib/libc.so.56.0
(gdb) bt
#0 0x00000002072246fa in getpid () from /usr/lib/libc.so.56.0
#1 0x000000020728ce99 in arc4random_stir () from /usr/lib/libc.so.56.0
#2 0x000000020728cef9 in arc4random_buf () from /usr/lib/libc.so.56.0
#3 0x000000020726a795 in _citrus_utf8_ctype_wcsrtombs () from
/usr/lib/libc.so.56.0
#4 0x000000020726c12d in free () from /usr/lib/libc.so.56.0
#5 0x000000020726cba3 in malloc () from /usr/lib/libc.so.56.0
#6 0x0000000000423c3e in ngx_alloc ()
#7 0x000000000044cde2 in ngx_http_file_cache_set_header ()
#8 0x000000000044d080 in ngx_http_file_cache_set_header ()
#9 0x0000000000427329 in ngx_single_process_cycle ()
#10 0x00000000004219aa in ngx_event_expire_timers ()
#11 0x000000000042189d in ngx_process_events_and_timers ()
#12 0x0000000000427f57 in ngx_master_process_cycle ()
#13 0x0000000000425cff in ngx_spawn_process ()
#14 0x0000000000426833 in ngx_single_process_cycle ()
#15 0x00000000004276c6 in ngx_master_process_cycle ()
#16 0x000000000040e616 in main ()
(gdb) bt
#0 0x00000002072246fa in getpid () from /usr/lib/libc.so.56.0
#1 0x000000020728ce99 in arc4random_stir () from /usr/lib/libc.so.56.0
#2 0x000000020728cef9 in arc4random_buf () from /usr/lib/libc.so.56.0
#3 0x000000020726a795 in _citrus_utf8_ctype_wcsrtombs () from
/usr/lib/libc.so.56.0
#4 0x000000020726c12d in free () from /usr/lib/libc.so.56.0
#5 0x000000020726cba3 in malloc () from /usr/lib/libc.so.56.0
#6 0x0000000000423c3e in ngx_alloc ()
#7 0x000000000044cde2 in ngx_http_file_cache_set_header ()
#8 0x000000000044d080 in ngx_http_file_cache_set_header ()
#9 0x0000000000427329 in ngx_single_process_cycle ()
#10 0x00000000004219aa in ngx_event_expire_timers ()
#11 0x000000000042189d in ngx_process_events_and_timers ()
#12 0x0000000000427f57 in ngx_master_process_cycle ()
#13 0x0000000000425cff in ngx_spawn_process ()
#14 0x0000000000426833 in ngx_single_process_cycle ()
#15 0x00000000004276c6 in ngx_master_process_cycle ()
#16 0x000000000040e616 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
(gdb) bt
#0 0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
#1 0x0000000000417982 in ngx_time_update ()
#2 0x000000000044c73a in ngx_http_file_cache_set_header ()
#3 0x000000000044d08a in ngx_http_file_cache_set_header ()
#4 0x0000000000427329 in ngx_single_process_cycle ()
#5 0x00000000004219aa in ngx_event_expire_timers ()
#6 0x000000000042189d in ngx_process_events_and_timers ()
#7 0x0000000000427f57 in ngx_master_process_cycle ()
#8 0x0000000000425cff in ngx_spawn_process ()
#9 0x0000000000426833 in ngx_single_process_cycle ()
#10 0x00000000004276c6 in ngx_master_process_cycle ()
#11 0x000000000040e616 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
(gdb) bt
#0 0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
#1 0x0000000000417982 in ngx_time_update ()
#2 0x000000000044c73a in ngx_http_file_cache_set_header ()
#3 0x000000000044d08a in ngx_http_file_cache_set_header ()
#4 0x0000000000427329 in ngx_single_process_cycle ()
#5 0x00000000004219aa in ngx_event_expire_timers ()
#6 0x000000000042189d in ngx_process_events_and_timers ()
#7 0x0000000000427f57 in ngx_master_process_cycle ()
#8 0x0000000000425cff in ngx_spawn_process ()
#9 0x0000000000426833 in ngx_single_process_cycle ()
#10 0x00000000004276c6 in ngx_master_process_cycle ()
#11 0x000000000040e616 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
(gdb) bt
#0 0x000000020722cc4a in gettimeofday () from /usr/lib/libc.so.56.0
#1 0x0000000000417982 in ngx_time_update ()
#2 0x000000000044c73a in ngx_http_file_cache_set_header ()
#3 0x000000000044d08a in ngx_http_file_cache_set_header ()
#4 0x0000000000427329 in ngx_single_process_cycle ()
#5 0x00000000004219aa in ngx_event_expire_timers ()
#6 0x000000000042189d in ngx_process_events_and_timers ()
#7 0x0000000000427f57 in ngx_master_process_cycle ()
#8 0x0000000000425cff in ngx_spawn_process ()
#9 0x0000000000426833 in ngx_single_process_cycle ()
#10 0x00000000004276c6 in ngx_master_process_cycle ()
#11 0x000000000040e616 in main ()

Any input is really appreciated.

Thanks,
Srebrenko

SSJSSrSSme_LoyetSS_S · May 19, 2011, 10:55pm

One possible reason could be some race condition when the server is
still running in the reload state and it receives another reload command

this can be caused by provision in Aegir system for Drupal, as it
reloads nginx sometimes two times during one action, like platform/site
migration to make sure all vhosts are reloaded correctly.

Posted at Nginx Forum:

SSJSSrSSme_LoyetSS_S · May 20, 2011, 11:40am

Hello!

On Thu, May 19, 2011 at 03:51:11PM -0400, omega8cc wrote:

Unfortunately, it still happens also for 1.0.2 release.

I have seen it over 10 times on two different servers since upgrade to
1.0.0, then 1.0.1 and 1.0.2.

Not sure how to debug this, as it is random issue, yet causes downtime
until our auto-healing will kill all nginx processes and wait for load
to stabilize to start nginx again.

There were two issues discussed in this thread which caused cpu
hog in cache manager:

Earlier SIGSEGV in worker process while holding cache zone lock
with resulting ngx_spinlock() loop in cache manager. Fix is to
trace and fix this earlier SIGSEGV (looking though error log
should help, btw).
Cache manager was looping if it wasn’t able to made any
progress. This was fixed in 1.0.0 (patch present in this thread
was committed).

I suspect your problem is (1), though it may be some other problem
as well. Please provide at least backtrace of runaway process to
make further debugging possible. This may be done with

gdb /path/to/nginx

bt

(and please make sure your binary isn’t stripped, else backtrace
will contain only ‘??’ instead of function names)

Running something like ktrace/strace may be also helpfull.

Maxim D.