Issue #5811 has been updated by samg (Sam Goldstein).


This does appear to be fixed in Ruby 2.0.0-p0.  I'm no longer able to get my reproduction case to hang (It doesn't hang every time in 1.9.3, but it only takes a try or two to trigger the deadlock).

I'm curious if there's any suggested workaround for this in ruby 1.9.3.  I develop a library (newrelic) so I can't control the ruby version which it is run under.  Generally it's okay if the backticked command fails (since we can recover from that) but deadlocking the whole process is obviously problematic.

Thanks!
----------------------------------------
Bug #5811: Ruby Process Deadlocks With Fork on Mac OS X Lion
https://bugs.ruby-lang.org/issues/5811#change-38038

Author: netshade (Chris Zelenak)
Status: Closed
Priority: Normal
Assignee: akr (Akira Tanaka)
Category: 
Target version: 
ruby -v: ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]


=begin
Given a Ruby process that acts like the following:

* Spawn new thread that initializes a TCPSocket
* Execute script using backticks in main thread

there is a chance that it will deadlock on Lion.  The GDB traces for the threads show:

* The TCP connecting thread stuck on native_cond_wait/thread_pthread.c:321 by way of rsock_getaddrinfo/raddrinfo.c:359
* The main thread stuck on read() by way of rb_f_backquote/io.c:7266

Meanwhile, in the forked process from rb_f_backquote: 

* The main thread is stuck at (longer trace):
 #0  0x00007fff9160c6b6 in semaphore_wait_trap ()
 #1  0x00007fff8fc03bc2 in _dispatch_thread_semaphore_wait ()
 #2  0x00007fff8fc04286 in dispatch_once_f ()
 #3  0x00007fff95e12f20 in si_module_static_search ()
 #4  0x00007fff95e16a3d in si_module_with_name ()
 #5  0x00007fff95e0eac8 in getpwuid ()
 #6  0x00007fff90daa842 in getgroups$DARWIN_EXTSN ()
 #7  0x000000010b82b020 in rb_group_member (gid=0) at file.c:1002
 #8  0x000000010b82b10f in eaccess (path=0x7fff6b3d3570 "/bin/hostname", mode=1) at file.c:1052
 ...

The documentation for getpwuid in Mac OS X Lion states that getpwuid now is threadsafe, much like getpwuid_r - however, the values returned by getpwuid are thread local and disposed automatically, as opposed to getpwuid_r's allocation of results.  The disassembly of semaphore_wait_trap and __psynch_cvwait  both show syscalls being made (I don't know how to go much further here), but the arguments are all void to these functions too when snooping in GDB.  I believe that the posix wait and semaphore_wait taking place are in fact making syscalls to wait on a condition variable of the same value - this value is the same due to the shared memory state of the fork.  

When an artificial delay ("sleep 1") is introduced after the creation of the TCP connect thread, this deadlock no longer occurs.

Attached is a test script that uses the Instrumental Agent gem for the TCP connect and can reliably cause the deadlock under 1.9.3.
=end



-- 
http://bugs.ruby-lang.org/