Issue #14181 has been updated by nobu (Nobuyoshi Nakada).


It seems the signal trap causes thread switching then `Process.waitpid` exits.
That means the target thread status can change during `RUBY_VM_CHECK_INTS_BLOCKING`, but `sleep_forever` doesn't consider the condition to wait at that moment.

```diff
diff --git a/thread.c b/thread.c
index baa50ea388..cc62ea3905 100644
--- a/thread.c
+++ b/thread.c
@@ -883,7 +883,13 @@ thread_join_sleep(VALUE arg)
 
     while (target_th->status != THREAD_KILLED) {
 	if (forever) {
-	    sleep_forever(th, TRUE, FALSE);
+	    th->status = THREAD_STOPPED_FOREVER;
+	    th->vm->sleeper++;
+	    rb_check_deadlock(th->vm);
+	    native_sleep(th, 0);
+	    th->vm->sleeper--;
+	    RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
+	    th->status = THREAD_RUNNABLE;
 	}
 	else {
 	    double now = timeofday();
```

----------------------------------------
Bug #14181: hangs or deadlocks from waitpid, threads, and trapping SIGCHLD
https://bugs.ruby-lang.org/issues/14181#change-68431

* Author: ccutrer (Cody Cutrer)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-linux-gnu]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN
----------------------------------------
I'm not exactly sure what's going on here, but the end result is basically a thread is getting killed unexpectedly during a waitpid call, when SIGCHLD is being handled. In a more complex scenario, we end up hanging because Thread#join is ends up waiting on a thread that's already dead (presumably because it died in a non-standard way), or in a simpler scenario, the output is:

```
loop 250
loop 251
/usr/lib/ruby/2.4.0/timeout.rb:97:in `join': No live threads left. Deadlock? (fatal)
1 threads, 1 sleeps current:0x00000000019205e0 main thread:0x00000000019205e0
* #<Thread:0x0000000001955e38 sleep_forever>
   rb_thread_t:0x00000000019205e0 native:0x00007f900a082700 int:0
   /usr/lib/ruby/2.4.0/timeout.rb:97:in `join'
   /usr/lib/ruby/2.4.0/timeout.rb:97:in `ensure in block in timeout'
   /usr/lib/ruby/2.4.0/timeout.rb:97:in `block in timeout'
   /usr/lib/ruby/2.4.0/timeout.rb:33:in `block in catch'
   /usr/lib/ruby/2.4.0/timeout.rb:33:in `catch'
   /usr/lib/ruby/2.4.0/timeout.rb:33:in `catch'
   /usr/lib/ruby/2.4.0/timeout.rb:108:in `timeout'
   ./test.rb:11:in `<main>'
	from /usr/lib/ruby/2.4.0/timeout.rb:97:in `ensure in block in timeout'
	from /usr/lib/ruby/2.4.0/timeout.rb:97:in `block in timeout'
	from /usr/lib/ruby/2.4.0/timeout.rb:33:in `block in catch'
	from /usr/lib/ruby/2.4.0/timeout.rb:33:in `catch'
	from /usr/lib/ruby/2.4.0/timeout.rb:33:in `catch'
	from /usr/lib/ruby/2.4.0/timeout.rb:108:in `timeout'
	from ./test.rb:11:in `<main>'
```

The simpler repro, where I'm obviously not doing anything I shouldn't be doing in the signal handler:

```
#!/usr/bin/env ruby

require 'timeout'

trap(:CHLD) { }

x = 0
while true
  puts "loop #{x += 1}"
  pid = Process.spawn('sleep 1')
  Timeout.timeout(30) do
    Process.waitpid(pid)
  end
end
```

A slightly more complex repro that I'm still pretty sure what I'm doing in the signal handler is okay, but ends up hanging:

```
#!/usr/bin/env ruby

require 'timeout'

self_pipe = IO.pipe
signal_queue = []

def wake_up(self_pipe)
  self_pipe[1].write_nonblock('.', exception: false)
end

trap(:CHLD) { signal_queue << :CHLD; wake_up(self_pipe)  }

signal_processor = Thread.new do
  loop do
    self_pipe[0].read(1)
    signal_queue.pop
  end
end

x = 0
while true
  puts "loop #{x += 1}"
  pid = Process.spawn('sleep 1')
  Timeout.timeout(30) do
    Process.waitpid(pid)
  end
end
```

In either case, it can take many loops before it fails, up to a few hundred. I've reproed on both Ubuntu Xenial, and macOS 10.12.6 (the former with ruby 2.4.2, the latter with ruby 2.4.1).



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>