On Wed, Dec 23, 2015 at 11:57:53AM +0300, wrote:
What happens with requests after this time? Do they hit backend and update
cache? Do they use stale version?
As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager. Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.
This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set. But in general
it indicates that there is a problem.
(Note well that “locked” here should not be confused with
request-level locks as in proxy_cache_lock.)
I’m going to check “long locked” messages in the log to see how many was
for “/” location.
The hash should be the same if we didn’t change cache key, right?
Any message indicate that there is a problem. If the “/” location
is requested often, it’ll likely won’t be mentioned in the “ignore long
Most likely reason is that a worker died or was killed
while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).
Shouldn’t be there a record in error log? Error log level at warn.
There should be something like “worker process exited” messages,
though at least some of them may be on the “notice” level (e.g.,
if you force nginx worker to quit by sending SIGTERM to it
Trivial things to consider:
- check logs for segmentation faults;
no seg faults in logs
Any other alerts, crit or emerg messages? Any alerts about
socket leaks on configuration reloads (“open socket left…”),
sockets in CLOSED state? How long relevant requests to upstream
server take (i.e., make sure that the behaviour observed isn’t a
result of an upstream server just being very slow)?
- if you are using 3rd party modules / patches, try without them;
from freebsd port, updated gist  with
nginx -V output
The only suspicious thing I see there is “ssi on” and
“proxy_cache_lock” at the same time. There were some fixes
related to proxy_cache_lock with subrequests in more recent
- make sure you don’t kill worker processes yourself or using some
automation scripts (in particular, don’t try to terminate old
worker processes after a configuration reload).
One recent appearance of the problem was at 1:30AM and I checked
logs for crazy midnight deploys - nothing.
Also, we don’t use anything custom to restart nginx, just regular services
Just a regular kill will be enough screw up things. On the other
hand, it may be an nginx bug as well.
Note though, that the version you are using is quite old.
Debugging anything but 1.9.x hardly make sense. You may want to
try upgrading to the latest nginx-devel port.
Note well that if a problem happens, it’s a good idea to either
completely restart nginx or do binary upgrade procedure (“service
nginx upgrade”) to make sure shared memory structures are intact.