I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler ("foreman")
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

Thread 4 (process 5488):
#0  0x00d4d410 in __kernel_vsyscall ()
#1  0x00d1a0b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x00d15834 in _L_lock_92 () from /lib/libpthread.so.0
#3  0x00d1534a in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x0811d82e in native_mutex_lock (lock=0xfffffffc) at
thread_pthread.c:36
#5  0x081226a6 in rb_thread_call_with_gvl (func=0x8062b40 <gc_with_gvl>,
data1=0x81954e0) at thread.c:984
#6  0x0806282e in garbage_collect_with_gvl (objspace=0x81954e0) at
gc.c:598
#7  0x080628a6 in vm_xmalloc (objspace=0x81954e0, size=160796684) at
gc.c:624
#8  0x080de09e in rb_str_buf_new (capa=128) at string.c:722
#9  0x080d29bb in rb_enc_vsprintf (enc=0x0, fmt=0x81552ac "%s:%d:in
`%s'", ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1125
#10 0x080d2a50 in rb_vsprintf (fmt=0x81552ac "%s:%d:in `%s'",
ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1154
#11 0x080d2a7b in rb_sprintf (format=0x81552ac "%s:%d:in `%s'") at
sprintf.c:1164
#12 0x0810ef01 in vm_backtrace_each (th=<value optimized out>,
limit_cfp=0xb7fbea98, cfp=0xb7fbed98, file=0x9a42158
"/local/pete/current/com.rb", line_no=34, ary=174937756) at vm.c:714
#13 0x0810fd32 in rb_f_caller (argc=0, argv=0xb7f3f1f8) at vm.c:746
#14 0x0810dea6 in call_cfunc (func=0x810fca0 <rb_f_caller>,
recv=165378108, len=128, argc=-4, argv=0xb7f3f1f8) at
vm_insnhelper.c:286

The foreman's apparently trying to lock the Global Vm Lock to do a
malloc().
Several questions:
1) How to tell which thread owns the global_vm_lock? In gdb, printing
th->blocking_region_buffer for the main thread returns a pointer to
something, while another thread returns a zero value for
blocking_region_buffer. This suggests that the other thread owns the
global_vm_lock. On the other hand, if I dump the actual global_vm_lock,
the __data.__owner is 5488, the thread id for the main thread. I'm
thinking that this thread is trying to lock the GVL when it already has
it, but I'm not sure.
2) in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl. How
this happens is a complete mystery.
3) within gdb, is it possible to figure out what the content of the
backtrace would be if it didn't hang? I know that ruby's backtrace stack
is somewhat complex - any pointers to simple-ish ways to find out where
the ruby interpreter is?

Thanks,
Bill Paulson

-- 
Posted via http://www.ruby-forum.com/.