I have a MRI 2.1.3 process in production which is completely
unresponsive to signals and appears to be dead. Debugging it seems to
indicate that all threads are blocked on something but I can’t tell
what.
Here’s the GitHub issue with detailed info. Any tips on how to
determine which threads are deadlocked and/or what lock/mutex they are
waiting on?
opened 04:46PM - 23 Mar 15 UTC
closed 09:27PM - 25 Mar 15 UTC
A customer is reporting Sidekiq is "stuck", it no longer pulls jobs or responds … to signals.
When running `sudo strace -p <PID>`, the output is simply:
```
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
futex(0x7f07a814bf64, FUTEX_WAIT_PRIVATE, 64228, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
futex(0x7f07a814bf64, FUTEX_WAIT_PRIVATE, 64228, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a) = -1 EINTR (Interrupted system call)
futex(0x7f07a814bf64, FUTEX_WAIT_PRIVATE, 64228, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
```
Note **ERESTARTSYS**. I've never seen that errno before.
When I run gdb, all threads appear blocked:
```
(gdb) info threads
Id Target Id Frame
37 Thread 0x7f07a7680700 (LWP 32477) "ruby-timer-thr" 0x00007f07a66c8d13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
36 Thread 0x7f07a2a57700 (LWP 32478) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
35 Thread 0x7f07a2755700 (LWP 32479) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
34 Thread 0x7f07a24d3700 (LWP 32480) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
33 Thread 0x7f07a22d1700 (LWP 32481) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
32 Thread 0x7f07a74f6700 (LWP 32483) "SignalSender" 0x00007f07a7255490 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
31 Thread 0x7f079ce51700 (LWP 32484) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
30 Thread 0x7f079cd50700 (LWP 32487) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
29 Thread 0x7f079ca48700 (LWP 32496) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
28 Thread 0x7f079c947700 (LWP 32497) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
27 Thread 0x7f079c846700 (LWP 32498) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
26 Thread 0x7f079c745700 (LWP 32499) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
25 Thread 0x7f079c644700 (LWP 32500) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
24 Thread 0x7f079c543700 (LWP 32501) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
23 Thread 0x7f079c442700 (LWP 32502) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
22 Thread 0x7f079c341700 (LWP 32503) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
21 Thread 0x7f079c240700 (LWP 32504) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
20 Thread 0x7f079c13f700 (LWP 32505) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
19 Thread 0x7f0797fff700 (LWP 32506) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
18 Thread 0x7f0797efe700 (LWP 32507) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
17 Thread 0x7f0797dfd700 (LWP 32508) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
16 Thread 0x7f0797cfc700 (LWP 32509) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
15 Thread 0x7f0797bfb700 (LWP 32510) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
14 Thread 0x7f0797afa700 (LWP 32511) "ruby" 0x00007f07a7255d5c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
13 Thread 0x7f07979f9700 (LWP 32512) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
12 Thread 0x7f07978f8700 (LWP 32513) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
11 Thread 0x7f07977f7700 (LWP 32514) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
10 Thread 0x7f07976f6700 (LWP 32515) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
9 Thread 0x7f07975f5700 (LWP 32516) "ruby" 0x00007f07a66c8d13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
8 Thread 0x7f07974f4700 (LWP 32517) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
7 Thread 0x7f07973f3700 (LWP 32518) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
6 Thread 0x7f07972f2700 (LWP 32519) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
5 Thread 0x7f07971f1700 (LWP 32520) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
4 Thread 0x7f07970f0700 (LWP 32521) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
3 Thread 0x7f0796e6f700 (LWP 32522) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
2 Thread 0x7f0796a6e700 (LWP 1452) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1 Thread 0x7f07a7676700 (LWP 32475) "ruby" 0x00007f07a7253344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb)
```
Note there's a couple of odd threads; let's take a look at Thread 14:
```
(gdb) thread 14
[Switching to thread 14 (Thread 0x7f0797afa700 (LWP 32511))]
#0 0x00007f07a7255d5c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007f07a7255d5c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f07a72513c4 in _L_lock_997 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f07a7251227 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f07a129f03d in v8::internal::ThreadManager::Lock() ()
from /srv/cm/shared/bundle/ruby/2.1.0/extensions/x86_64-linux/2.1.0-static/therubyracer-0.12.1/v8/init.so
#4 0x00007f07a129f64f in v8::Locker::Initialize(v8::Isolate*) ()
from /srv/cm/shared/bundle/ruby/2.1.0/extensions/x86_64-linux/2.1.0-static/therubyracer-0.12.1/v8/init.so
#5 0x00007f07a1099711 in rr::Locker::setupLockAndCall (state=0x7f07961e46d4, code=139671018516760) at locker.cc:47
#6 0x00007f07a10996cf in rr::Locker::doLock (argc=0, argv=0x7f079a328108, self=139670864563880) at locker.cc:39
#7 0x00007f07a77e1e00 in vm_call_cfunc_with_frame (th=0x7f07b3571970, reg_cfp=0x7f079a345e70, ci=<optimized out>) at vm_insnhelper.c:1489
#8 0x00007f07a77e7383 in vm_exec_core (th=0x7f07b3571970, initial=initial@entry=0) at insns.def:999
#9 0x00007f07a77eac4a in vm_exec (th=th@entry=0x7f07b3571970) at vm.c:1363
#10 0x00007f07a77ec88b in invoke_block_from_c (th=0x7f07b3571970, block=0x7f079a346020, self=139671036950840, argc=<optimized out>, argv=<optimized out>,
blockptr=<optimized out>, cref=0x0, defined_class=139670875465600) at vm.c:782
#11 0x00007f07a77f3e99 in vm_yield (argv=0x7f07961e4e28, argc=1, th=<optimized out>) at vm.c:821
#12 rb_yield_0 (argv=0x7f07961e4e28, argc=1) at vm_eval.c:938
```
It's stuck inside therubyracer. It's worrisome that the call does not appear to release the GVL. If TRR takes a lock and waits forever, does that lock up MRI?
How about Thread 9?
```
(gdb) thread 9
[Switching to thread 9 (Thread 0x7f07975f5700 (LWP 32516))]
#0 0x00007f07a66c8d13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f07a66c8d13 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f07a051dcd0 in ?? () from /usr/lib/libpq.so.5
#2 0x00007f07a051ddc0 in ?? () from /usr/lib/libpq.so.5
#3 0x00007f07a051bf59 in PQgetResult () from /usr/lib/libpq.so.5
#4 0x00007f07a051c1f8 in ?? () from /usr/lib/libpq.so.5
#5 0x00007f07a0743799 in gvl_PQexecPrepared_skeleton (data=0x7f0796f62440) at gvl_wrappers.c:9
#6 0x00007f07a78059c2 in call_without_gvl (func=func@entry=0x7f07a0743770 <gvl_PQexecPrepared_skeleton>, data1=data1@entry=0x7f0796f62440,
ubf=<optimized out>, ubf@entry=0xffffffffffffffff, data2=<optimized out>, data2@entry=0x0, fail_if_interrupted=0) at thread.c:1266
#7 0x00007f07a7805ca5 in rb_thread_call_without_gvl (func=func@entry=0x7f07a0743770 <gvl_PQexecPrepared_skeleton>, data1=data1@entry=0x7f0796f62440,
ubf=ubf@entry=0xffffffffffffffff, data2=data2@entry=0x0) at thread.c:1376
```
Blocked inside the postgres driver. Looks normal.
Mike
Disclaimer: I have never worked with Sidekiq.
Mike P. wrote in post #1170793:
I have a MRI 2.1.3 process in production which is completely
unresponsive to signals and appears to be dead. Debugging it seems to
indicate that all threads are blocked on something but I can’t tell
what.
If there is no work in any queue and nobody currently tries to add work
to a queue that state (all threads blocked) would be expected.
Here’s the GitHub issue with detailed info. Any tips on how to
determine which threads are deadlocked and/or what lock/mutex they are
waiting on?
It may be more interesting to pull the strace or debug while an attempt
is made to add another work item.
I notice method names like “rr::Locker::setupLockAndCall”, apparently
this is from therubyracer. Maybe there are some unhealthy interactions
between that and Sidekiq.
Additional hint: with thread / process pools of limited size a subtle
form of deadlock can be created that is not caused by flawed application
logic. I encountered it the first time with Oracle Shared Server (see
[1] for details). Basically this will create a limited size pool of
worker processes. Now, here is a deadlock that you can create by
switching from Dedicated Server (the default) to Shared Server:
connection A starts a transaction and locks a resource X. Other
connections also start transactions and want to do something that needs
X unlocked; these transactions are executed and block. Now, if there
are at least as many tasks as the limit of the process pool, the system
will starve because A will never get a chance to complete the
transaction. This will not happen with Dedicated Server (or unlimited
thread / process pools) as A will be able to complete.
[1]
http://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc001.htm#ADMIN11166