Deadlock in large vm_xmalloc while doing backtrace

aris · June 5, 2012, 8:52pm

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler (“foreman”)
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

Thread 4 (process 5488):
#0 0x00d4d410 in __kernel_vsyscall ()
#1 0x00d1a0b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00d15834 in _L_lock_92 () from /lib/libpthread.so.0
#3 0x00d1534a in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0811d82e in native_mutex_lock (lock=0xfffffffc) at
thread_pthread.c:36
#5 0x081226a6 in rb_thread_call_with_gvl (func=0x8062b40 <gc_with_gvl>,
data1=0x81954e0) at thread.c:984
#6 0x0806282e in garbage_collect_with_gvl (objspace=0x81954e0) at
gc.c:598
#7 0x080628a6 in vm_xmalloc (objspace=0x81954e0, size=160796684) at
gc.c:624
#8 0x080de09e in rb_str_buf_new (capa=128) at string.c:722
#9 0x080d29bb in rb_enc_vsprintf (enc=0x0, fmt=0x81552ac “%s:%d:in
%s'", ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1125 #10 0x080d2a50 in rb_vsprintf (fmt=0x81552ac "%s:%d:in%s’”,
ap=0xbf8f95e4 “X!?\t”") at sprintf.c:1154
#11 0x080d2a7b in rb_sprintf (format=0x81552ac “%s:%d:in `%s’”) at
sprintf.c:1164
#12 0x0810ef01 in vm_backtrace_each (th=,
limit_cfp=0xb7fbea98, cfp=0xb7fbed98, file=0x9a42158
“/local/pete/current/com.rb”, line_no=34, ary=174937756) at vm.c:714
#13 0x0810fd32 in rb_f_caller (argc=0, argv=0xb7f3f1f8) at vm.c:746
#14 0x0810dea6 in call_cfunc (func=0x810fca0 <rb_f_caller>,
recv=165378108, len=128, argc=-4, argv=0xb7f3f1f8) at
vm_insnhelper.c:286

The foreman’s apparently trying to lock the Global Vm Lock to do a
malloc().
Several questions:

How to tell which thread owns the global_vm_lock? In gdb, printing
th->blocking_region_buffer for the main thread returns a pointer to
something, while another thread returns a zero value for
blocking_region_buffer. This suggests that the other thread owns the
global_vm_lock. On the other hand, if I dump the actual global_vm_lock,
the __data.__owner is 5488, the thread id for the main thread. I’m
thinking that this thread is trying to lock the GVL when it already has
it, but I’m not sure.
in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl. How
this happens is a complete mystery.
within gdb, is it possible to figure out what the content of the
backtrace would be if it didn’t hang? I know that ruby’s backtrace stack
is somewhat complex - any pointers to simple-ish ways to find out where
the ruby interpreter is?

Thanks,
Bill P.

bplummer · June 6, 2012, 12:12am

Bill P. писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on
Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler (“foreman”)
occasionally locks up in one site or another with a stack trace for
the
main thread that looks like the following (many levels of stack
elided):

1.9.1 is something of a danger sign for me: it never worked good enough
for
production, at least for me.

data1=0x81954e0) at thread.c:984
sprintf.c:1164
Several questions:
2) in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl.
How
this happens is a complete mystery.
3) within gdb, is it possible to figure out what the content of the
backtrace would be if it didn’t hang? I know that ruby’s backtrace
stack
is somewhat complex - any pointers to simple-ish ways to find out
where
the ruby interpreter is?

Take a look at ruby’s gdbinit:
http://rxr.whitequark.org/mri/source/.gdbinit

For example, try rb_backtrace or rb_ps at gdb’s prompt.

bplummer · June 6, 2012, 3:43pm

Bill P. писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler (“foreman”)
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

1.9.1 is something of a danger sign for me: it never worked good enough for
production, at least for me.
I’ll look into changing to a later release.

data1=0x81954e0) at thread.c:984
sprintf.c:1164
Several questions:
allocation appears to be what triggers garbage collection with gvl. How

The ruby we’re running doesn’t seem to have debug.c compiled into it.
The .gdbinit file fails with a message
No symbol “ruby_dummy_gdb_enums” in current context.
Do you know what compile flags cause debug.c to be included?
Thanks
Bill P.

bplummer · June 7, 2012, 8:42pm

On Jun 5, 2012, at 5:11 PM, Peter Z. [email protected]
wrote:

Bill P. писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler (“foreman”)
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack

While not completely certain, this bug appears to be due to using the
‘caller’ method inside an object finalizer. I’ve written a bug report on
a simpler program (Bug #6558: Crash in garbage collection - using caller inside finalizer method - Ruby master - Ruby Issue Tracking System) which I think
is related. Removing ‘caller()’ from the finalizer in our real code
seems to have stopped the problems (knocks on wood and crosses fingers.)

Bill P.

bplummer · June 6, 2012, 8:35pm

William Paulson писал 06.06.2012 17:42:

Bill P. писал 05.06.2012 22:52:

Bill P.
I compile my Rubies with rvm1, and it inserts debug flags
automatically.
I’d say you should look into options of ./configure script: something
like
–with-debug=yes should make the trick.

bplummer · June 8, 2012, 2:04am

2012/6/5 Bill P. [email protected]:

How to tell which thread owns the global_vm_lock?

In some 1.9.X versions of Ruby there was a public funtion in thread.c:

int rb_has_gvl_p(void)

or maybe:

int has_gvl_p(void)

Unfortunately it’s not present anymore in 1.9.3p0 (but it was in some
1.9.2pXXX version, sure, I used it).