Issue #7693 has been reported by apoikos (Apollon Oikonomopoulos). ---------------------------------------- Bug #7693: Thread deadlock (gvl.wait_yield is not (re-)initialized) https://bugs.ruby-lang.org/issues/7693 Author: apoikos (Apollon Oikonomopoulos) Status: Open Priority: Normal Assignee: Category: core Target version: ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the following scenario: We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The New Relic agent runs on a separate thread and simply collects process metrics from /proc/self and submits them over the network. The workers fork() a child for every job they process. Sporadically, one or two of the forked children will hang, all of them in the following state: (gdb) info threads 3 Thread 32111 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 2 Thread 32594 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82 * 1 Thread 32126 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211 Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up every minute and does work). The weird thing is the state of thread 3 (main thread) in this case: (gdb) thread apply 3 bt Thread 3 (Thread 32111): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007feac8af11a9 in native_cond_wait (cond=0x108cfd4, mutex=0x80) at thread_pthread.c:305 #2 0x00007feac8af47f9 in gvl_yield (limits_us=<value optimized out>) at thread_pthread.c:109 #3 rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025 #4 0x00007feac8af48d0 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1327 #5 0x00007feac8ae63ed in vm_exec_core (th=0x108d520, initial=128) at insns.def:1184 #6 0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220 #7 0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:624 #8 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670 #9 0x00007feac89fdd1a in proc_call (argc=1, argv=0x27e3d70, procval=<value optimized out>) at proc.c:562 #10 0x00007feac8ae9433 in vm_call0 (th=0x108d520, recv=<value optimized out>, id=<value optimized out>, argc=1, argv=0x27e3d70, me=0x1191f60) at vm_eval.c:79 #11 0x00007feac8aea2d3 in rb_eval_cmd (cmd=<value optimized out>, arg=<value optimized out>, level=<value optimized out>) at vm_eval.c:1196 #12 0x00007feac8af48e2 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1290 #13 0x00007feac8af4c1c in rb_threadptr_execute_interrupts (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335 #14 set_unblock_function (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:269 #15 0x00007feac8af7e84 in rb_thread_blocking_region (func=0x7feac8a5d070 <rb_waitpid_blocking>, data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129 #16 0x00007feac8a5d174 in rb_waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655 #17 0x00007feac8a5d3c4 in rb_syswait (pid=32593) at process.c:3018 #18 0x00007feac8a1b365 in rb_io_close (io=<value optimized out>) at io.c:3838 #19 0x00007feac8a1f0b1 in rb_f_backquote (obj=<value optimized out>, str=41828360) at io.c:7519 ... Frame #2 reveals that it is stuck waiting in gvl_yield, in the following native_cond_wait (thread_pthread.c:109): /* An another thread is processing GVL yield. */ if (UNLIKELY(vm->gvl.wait_yield)) { while (vm->gvl.wait_yield) native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock); goto acquire; } Now, by reading the code of gvl_yield(), this would normally only be possible if we had another thread in native_cond_wait(), which is *not* the case. For completeness, here are the backtraces of the other threads: Thread 1 (New Relic agent) #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211 #1 0x00007feac8af2f8e in native_cond_timedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=<value optimized out>) at thread_pthread.c:323 #2 0x00007feac8af45e0 in native_sleep (th=0xca1a4f0, timeout_tv=0x7feabdfe4900) at thread_pthread.c:908 #3 0x00007feac8af4a82 in sleep_timeval (th=0xca1a4f0, tv=...) at thread.c:895 #4 0x00007feac8a5a842 in rb_f_sleep (argc=1, argv=0x7feabde64108) at process.c:3472 #5 0x00007feac8adfc3b in vm_call_cfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:404 #6 vm_call_method (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:534 #7 0x00007feac8ae2462 in vm_exec_core (th=0xca1a4f0, initial=<value optimized out>) at insns.def:1015 #8 0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220 #9 0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:624 #10 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670 #11 0x00007feac8af6898 in thread_start_func_2 (th=<value optimized out>, stack_start=0x7feabdfe6000) at thread.c:453 #12 0x00007feac8af68f0 in thread_start_func_1 (th_ptr=0xca1a4f0) at thread_pthread.c:651 #13 0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #15 0x0000000000000000 in ?? () Thread 2 (Timer): #0 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x00007feac8af3c92 in thread_timer (p=<value optimized out>) at thread_pthread.c:1156 #2 0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #3 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () We also see the following GVL state: (gdb) frame #3 rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025 (gdb) p th->vm->gvl.wait_yield $1 = 1 (gdb) p th->vm->gvl.need_yield $2 = 0 (gdb) p th->vm->gvl.waiting $3 = 0 Searching for gvl.wait_yield, we find out that it is *never* initialized in gvl_init(), along with the other GVL flags. Since it is only fork()ed children from multi-threaded parents that sporadically deadlock, and since gvl_init() is also called at fork time, we suspect that the child ends up with an inconsistent GVL state when a fork() happens while a thread is waiting in gvl_yield() (I'm not sure if this is possible though). The child then inherits a gvl.wait_yield = 1 which is not reset and leads to a deadlock the first time a gvl_yield() is processed in the new process. Setting wait_yield to 0 in gvl_init() (see attached patch) seems to fix the issue for us, I'm not sure it's the proper fix though. Regards, Apollon P.S.: The same issue should affect trunk as well, judging from the code. [1] https://github.com/defunkt/resque [2] https://github.com/newrelic/rpm
on 2013-01-14 10:05
on 2013-01-15 01:23
Issue #7693 has been updated by kosaki (Motohiro KOSAKI). Status changed from Open to Assigned Assignee set to kosaki (Motohiro KOSAKI) ---------------------------------------- Bug #7693: Thread deadlock (gvl.wait_yield is not (re-)initialized) https://bugs.ruby-lang.org/issues/7693#change-35414 Author: apoikos (Apollon Oikonomopoulos) Status: Assigned Priority: Normal Assignee: kosaki (Motohiro KOSAKI) Category: core Target version: ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the following scenario: We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The New Relic agent runs on a separate thread and simply collects process metrics from /proc/self and submits them over the network. The workers fork() a child for every job they process. Sporadically, one or two of the forked children will hang, all of them in the following state: (gdb) info threads 3 Thread 32111 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 2 Thread 32594 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82 * 1 Thread 32126 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211 Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up every minute and does work). The weird thing is the state of thread 3 (main thread) in this case: (gdb) thread apply 3 bt Thread 3 (Thread 32111): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007feac8af11a9 in native_cond_wait (cond=0x108cfd4, mutex=0x80) at thread_pthread.c:305 #2 0x00007feac8af47f9 in gvl_yield (limits_us=<value optimized out>) at thread_pthread.c:109 #3 rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025 #4 0x00007feac8af48d0 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1327 #5 0x00007feac8ae63ed in vm_exec_core (th=0x108d520, initial=128) at insns.def:1184 #6 0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220 #7 0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:624 #8 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670 #9 0x00007feac89fdd1a in proc_call (argc=1, argv=0x27e3d70, procval=<value optimized out>) at proc.c:562 #10 0x00007feac8ae9433 in vm_call0 (th=0x108d520, recv=<value optimized out>, id=<value optimized out>, argc=1, argv=0x27e3d70, me=0x1191f60) at vm_eval.c:79 #11 0x00007feac8aea2d3 in rb_eval_cmd (cmd=<value optimized out>, arg=<value optimized out>, level=<value optimized out>) at vm_eval.c:1196 #12 0x00007feac8af48e2 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1290 #13 0x00007feac8af4c1c in rb_threadptr_execute_interrupts (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335 #14 set_unblock_function (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:269 #15 0x00007feac8af7e84 in rb_thread_blocking_region (func=0x7feac8a5d070 <rb_waitpid_blocking>, data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129 #16 0x00007feac8a5d174 in rb_waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655 #17 0x00007feac8a5d3c4 in rb_syswait (pid=32593) at process.c:3018 #18 0x00007feac8a1b365 in rb_io_close (io=<value optimized out>) at io.c:3838 #19 0x00007feac8a1f0b1 in rb_f_backquote (obj=<value optimized out>, str=41828360) at io.c:7519 ... Frame #2 reveals that it is stuck waiting in gvl_yield, in the following native_cond_wait (thread_pthread.c:109): /* An another thread is processing GVL yield. */ if (UNLIKELY(vm->gvl.wait_yield)) { while (vm->gvl.wait_yield) native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock); goto acquire; } Now, by reading the code of gvl_yield(), this would normally only be possible if we had another thread in native_cond_wait(), which is *not* the case. For completeness, here are the backtraces of the other threads: Thread 1 (New Relic agent) #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211 #1 0x00007feac8af2f8e in native_cond_timedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=<value optimized out>) at thread_pthread.c:323 #2 0x00007feac8af45e0 in native_sleep (th=0xca1a4f0, timeout_tv=0x7feabdfe4900) at thread_pthread.c:908 #3 0x00007feac8af4a82 in sleep_timeval (th=0xca1a4f0, tv=...) at thread.c:895 #4 0x00007feac8a5a842 in rb_f_sleep (argc=1, argv=0x7feabde64108) at process.c:3472 #5 0x00007feac8adfc3b in vm_call_cfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:404 #6 vm_call_method (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:534 #7 0x00007feac8ae2462 in vm_exec_core (th=0xca1a4f0, initial=<value optimized out>) at insns.def:1015 #8 0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220 #9 0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:624 #10 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670 #11 0x00007feac8af6898 in thread_start_func_2 (th=<value optimized out>, stack_start=0x7feabdfe6000) at thread.c:453 #12 0x00007feac8af68f0 in thread_start_func_1 (th_ptr=0xca1a4f0) at thread_pthread.c:651 #13 0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #15 0x0000000000000000 in ?? () Thread 2 (Timer): #0 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x00007feac8af3c92 in thread_timer (p=<value optimized out>) at thread_pthread.c:1156 #2 0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #3 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #4 0x0000000000000000 in ?? () We also see the following GVL state: (gdb) frame #3 rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025 (gdb) p th->vm->gvl.wait_yield $1 = 1 (gdb) p th->vm->gvl.need_yield $2 = 0 (gdb) p th->vm->gvl.waiting $3 = 0 Searching for gvl.wait_yield, we find out that it is *never* initialized in gvl_init(), along with the other GVL flags. Since it is only fork()ed children from multi-threaded parents that sporadically deadlock, and since gvl_init() is also called at fork time, we suspect that the child ends up with an inconsistent GVL state when a fork() happens while a thread is waiting in gvl_yield() (I'm not sure if this is possible though). The child then inherits a gvl.wait_yield = 1 which is not reset and leads to a deadlock the first time a gvl_yield() is processed in the new process. Setting wait_yield to 0 in gvl_init() (see attached patch) seems to fix the issue for us, I'm not sure it's the proper fix though. Regards, Apollon P.S.: The same issue should affect trunk as well, judging from the code. [1] https://github.com/defunkt/resque [2] https://github.com/newrelic/rpm
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.