Cache manager occasionally stops deleting cached files

Hello,

I’m having an issue where nginx (1.8) cache manager suddenly just stops
deleting content thus the disk soon ends up being full until I restart
it by
hand. After it is restarted, it works normally for a couple of days, but
then it happens again. Cache has some 30-40k files, nothing huge.
Relevant
config lines are:

proxy_cache_path    /home/cache/ levels=2:2 keys_zone=cache:25m

inactive=7d max_size=2705g use_temp_path=on;
proxy_temp_path /dev/shm/temp; # reduces parallel writes on the
disk
proxy_cache_lock on;
proxy_cache_lock_age 10s;
proxy_cache_lock_timeout 30s;
proxy_ignore_client_abort on;

Server gets roughly 100 rps and normally cache manager deletes a couple
of
files every few seconds, however when it gets stuck this is all it does
for
20-30 minutes or more, i.e. there are 0 unlinks (until I restart it and
it
rereads the on-disk cache):


epoll_wait(14, {}, 512, 1000) = 0
epoll_wait(14, {}, 512, 1000) = 0
epoll_wait(14, {}, 512, 1000) = 0
epoll_wait(14, {}, 512, 1000) = 0
gettid() = 11303
write(24, “2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
inactive cache entry 380d3f178017bcd92877ee322b006bbb, count:1\n”, 123)

123
gettid() = 11303
write(24, “2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
inactive cache entry 7b9239693906e791375a214c7e36af8e, count:24\n”, 124)

124
epoll_wait(14, {}, 512, 1000) = 0

I assume the mentioned error is due to relatively often nginx restarts
and
is benign. There’s nothing else in the error log (except for occasional
upstream timeouts). I’m aware this likely isn’t enough info to debug the
issue, but do you at least have some ideas on what might be causing this
issue, where to look? I’m wild guessing cache manager waits for some
lock to
be released, but it never gets released so it just waits indefinitely.

Thanks,
Vedran

Posted at Nginx Forum:

Hello!

On Thu, Feb 18, 2016 at 11:20:55AM -0500, vedranf wrote:

proxy_temp_path     /dev/shm/temp; # reduces parallel writes on the

write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
be released, but it never gets released so it just waits indefinitely.
The error logged is due to an entry nginx is going to remove an
inactive cache entry but it is locked by some requests. Unless
inactive time is very low (not your case) it indicate a problem
somewhere else.

Such locked entries can’t be removed from cache. Addtitionally,
once there are enough such locked entries, nginx won’t be able to
purge cache based on max_size. That is, it’s expected that nginx
will have problems with removing entries from cache if you see
such messages.

Most trivial reason for such messages is abnormally killed nginx
processes. That is, if some processes die due to bugs, or killed
by an unwary administrator or an incorrect script - the problem
will appear sooner or later.

To further debug things, try the following:

  • restart nginx and record pids of all nginx processes;

  • once the problem starts to appear again, check if there are the
    same processes running;

  • if some processes different from one recorded, dig further to
    find out why.

Some trivial things like looking into logs for “worker process
exited …” messages and checking if the problem persists without
3rd party modules compiled in (see “nginx -V”) may also help.


Maxim D.
http://nginx.org/

Maxim D. Wrote:

Hello!

Hello and thanks for the reply!

be released, but it never gets released so it just waits
will have problems with removing entries from cache if you see
such messages.

Most trivial reason for such messages is abnormally killed nginx
processes. That is, if some processes die due to bugs, or killed
by an unwary administrator or an incorrect script - the problem
will appear sooner or later.

I see. I do have 1000-2000 of such errors in log per day, definitely
more
than couple of months ago. I remember server got crashed in the past,
but
not recently.

Some trivial things like looking into logs for “worker process
exited …” messages and checking if the problem persists without
3rd party modules compiled in (see “nginx -V”) may also help.

Thanks, I’ll dig deeper. I do have 3rd party modules and there are
occasional messages such as “worker process exited on signal 11”, but
they
are rare, i’ll try to figure out what causes them, but it’ll take time.
However, now that this already happens, is it possible so somehow unlock
all
entries and start clean, but without removing all cached content? Or
alternatively, can I delete the locked files manually as a workaround?

Regards,
Vedran

Posted at Nginx Forum:

Hello!

On Fri, Feb 19, 2016 at 08:00:54AM -0500, vedranf wrote:

[…]

Thanks, I’ll dig deeper. I do have 3rd party modules and there are
occasional messages such as “worker process exited on signal 11”, but they
are rare, i’ll try to figure out what causes them, but it’ll take time.

So the problem with cache is clear - one worker crash is usually
enough to stop nginx from removing cache items based on max_size
sooner or later. You have to further debug crashes to fix things.

However, now that this already happens, is it possible so somehow unlock all
entries and start clean, but without removing all cached content? Or
alternatively, can I delete the locked files manually as a workaround?

You need nginx to reload its cache matadata (as stored in
keys_zone). This can be done either with restart or binary
upgrade to the same nginx binary, see:

http://nginx.org/en/docs/control.html#upgrade

There is no need to delete cache files.


Maxim D.
http://nginx.org/

vedranf Wrote:

proxy_temp_path     /dev/shm/temp; # reduces parallel writes on

write(24, "2016/02/18 08:22:02 [alert] 11303#11303: ignore long locked
waits for some lock to be released, but it never gets released so it
just waits indefinitely.

Thanks,
Vedran

We have the same problem, but i’m not sure, that this is caused by often
nginx restarts.

As far as i know problem exist since version 1.6 (maybe even earlier,
1.4.6
from ubuntu repo is not affected) till now (1.9.9)

I’ve collected related forum posts (should help analyze the problem):
https://forum.nginx.org/read.php?21,258292,258292#msg-258292

Also, i think it’s somehow related to write connection leak. (see image
link)

https://s3.eu-central-1.amazonaws.com/drive-public-eu/nginx/betelgeuse_nginx_connections.PNG

Here we have our standard nginx configuration (before january, 28) with
7
days inactive time:

proxy_cache_path /mnt/cache1/nginx levels=2:2
keys_zone=a.d-1_cache:2143M
inactive=7d max_size=643G loader_sleep=1ms;

Every ~8 days (when writing connections reaches ~10k mark) cache starts
growing and fills the disk. Write connections falls on graph are nginx
restarts.

On january, 28 i changed inactive time to 8h. After write connections
hits
~10k mark, nginx starts filling logs with “ignore long locked inactive
cache
entry” message (1-2 messages per minute on average).

As you see write connections continuously grows. (When we had to power
off
the machine it’s reached ~60k).

For counting nginx connections we use standard http_stub_status_module.

I think that nginx “reference counter” could be broken, because total
established TCP connection remains the same all the time.

Posted at Nginx Forum:

Hello again.

Maxim D. Wrote:

I
This particular case was traced to segmentation faults, likely
_nginx_connections.PNG

I think that nginx “reference counter” could be broken, because
total
established TCP connection remains the same all the time.

Writing connections will grow due to segmentation faults as well,
so you are likely have the same problem. See basic
recommendations in my initial answer in this threads.

I’ve maded custom nginx build using latest version (1.9.13) without 3rd
party modules:

nginx -V
nginx version: nginx/1.9.13
built by gcc 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)
built with OpenSSL 1.0.1f 6 Jan 2014
TLS SNI support enabled
configure arguments: --with-cc-opt=‘-g -O2 -fPIE -fstack-protector
–param=ssp-buffer-size=4 -Wformat -Werror=format-security
-D_FORTIFY_SOURCE=2’ --with-ld-opt=‘-Wl,-Bsymbolic-functions -fPIE -pie
-Wl,-z,relro -Wl,-z,now’ --prefix=/usr/share/nginx
–conf-path=/etc/nginx/nginx.conf
–http-log-path=/var/log/nginx/access.log
–error-log-path=/var/log/nginx/error.log
–lock-path=/var/lock/nginx.lock
–pid-path=/run/nginx.pid
–http-client-body-temp-path=/var/lib/nginx/body
–http-fastcgi-temp-path=/var/lib/nginx/fastcgi
–http-proxy-temp-path=/var/lib/nginx/proxy
–http-scgi-temp-path=/var/lib/nginx/scgi
–http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit
–with-ipv6 --with-http_ssl_module --with-http_stub_status_module
–with-http_realip_module --with-http_auth_request_module
–with-http_addition_module --with-http_dav_module
–with-http_gunzip_module
–with-http_gzip_static_module --with-http_v2_module
–with-http_sub_module
–with-http_xslt_module --with-stream --with-stream_ssl_module
–with-mail
–with-mail_ssl_module --with-threads

Nothing changed: connections continuously grow, cache manager works fine
and
not filled the disk yet, but i think it’s a matter of 2-3 days.

PIDs didn’t changed since the start, and log didn’t contain “worker
process
exited …” messages.

Posted at Nginx Forum:

Hello!

On Thu, Feb 25, 2016 at 05:46:32AM -0500, vergil wrote:

vedranf Wrote:

Hello,

I’m having an issue where nginx (1.8) cache manager suddenly just
stops deleting content thus the disk soon ends up being full until I
restart it by hand. After it is restarted, it works normally for a
couple of days, but then it happens again. Cache has some 30-40k
files, nothing huge. Relevant config lines are:

[…]

We have the same problem, but i’m not sure, that this is caused by often
nginx restarts.

This particular case was traced to segmentation faults, likely
caused by 3rd party modules.

[…]

Also, i think it’s somehow related to write connection leak. (see image
link)

https://s3.eu-central-1.amazonaws.com/drive-public-eu/nginx/betelgeuse_nginx_connections.PNG

[…]

As you see write connections continuously grows. (When we had to power off
the machine it’s reached ~60k).

For counting nginx connections we use standard http_stub_status_module.

I think that nginx “reference counter” could be broken, because total
established TCP connection remains the same all the time.

Writing connections will grow due to segmentation faults as well,
so you are likely have the same problem. See basic
recommendations in my initial answer in this threads.


Maxim D.
http://nginx.org/