Issue #14867 has been updated by k0kubun (Takashi Kokubun).

Status changed from Closed to Assigned

Recently we see 2 types of deadlocks on CI with --jit or --jit-wait.

## 1. waitpid on #system, #`, or Process.waitpid

http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1431775
http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1431394
http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1430875

The stack trace on deadlock is like this:

~~~
0x00007fbd6585ecf6 in __GI_ppoll (fds=fds@entry=0x7ffd30485118, nfds=nfds@entry=1, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
>>> Machine level backtrace

Thread 3 (Thread 0x7fbd66c90700 (LWP 30964)):
#0  0x00007fbd667af384 in __libc_read (fd=6, buf=0x7fbd58047c70, nbytes=1024) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x000055ac93a5c433 in rb_thread_io_blocking_region (func=func@entry=0x55ac93925110 <internal_read_func>, data1=data1@entry=0x7fbd66c8f4c0, fd=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:1529
#2  0x000055ac93924ebe in rb_read_internal (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:987
#3  read_internal_call (arg=arg@entry=140451449992880) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2732
#4  0x000055ac938fa0d3 in rb_ensure (b_proc=0x55ac93924e80 <read_internal_call>, data1=140451449992880, e_proc=e_proc@entry=0x55ac93a286a0 <rb_str_unlocktmp>, data2=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:1052
#5  0x000055ac93a3c490 in rb_str_locktmp_ensure (str=<optimized out>, func=func@entry=0x55ac93924e80 <read_internal_call>, arg=arg@entry=140451449992880) at /home/ko1/ruby/src/trunk-mjit-wait/string.c:2654
#6  0x000055ac9393b99b in io_getpartial (argc=<optimized out>, argv=<optimized out>, io=<optimized out>, opts=opts@entry=8, nonblock=nonblock@entry=0) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2782
#7  0x000055ac9393bf51 in io_readpartial (argc=<optimized out>, argv=<optimized out>, io=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/io.c:2872
#8  0x000055ac93a90455 in vm_call_cfunc_with_frame (ci=0x55ac95418c40, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7fbd5f34bfa0, ec=0x55ac97772258) at /home/ko1/ruby/src/trunk-mjit-wait/vm_insnhelper.c:1914
... (snip) ...
#16 rb_vm_invoke_proc (ec=<optimized out>, proc=proc@entry=0x55ac99d5f2e0, argc=<optimized out>, argv=<optimized out>, passed_block_handler=passed_block_handler@entry=0) at /home/ko1/ruby/src/trunk-mjit-wait/vm.c:1191
#17 0x000055ac93a59615 in thread_do_start (args=94199785452720, th=0x55ac97a8d040) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:667
#18 thread_start_func_2 (th=th@entry=0x55ac97a8d040, stack_start=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:706
#19 0x000055ac93a59acb in thread_start_func_1 (th_ptr=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:1017
#20 0x00007fbd667a56db in start_thread (arg=0x7fbd66c90700) at pthread_create.c:463
#21 0x00007fbd6586b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fbd65749700 (LWP 30944)):
#0  0x00007fbd667ab9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55ac93e39428 <mjit_worker_wakeup+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55ac93e394c0 <mjit_engine_mutex>, cond=0x55ac93e39400 <mjit_worker_wakeup>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x55ac93e39400 <mjit_worker_wakeup>, mutex=mutex@entry=0x55ac93e394c0 <mjit_engine_mutex>) at pthread_cond_wait.c:655
#3  0x000055ac93a566a9 in rb_native_cond_wait (cond=cond@entry=0x55ac93e39400 <mjit_worker_wakeup>, mutex=mutex@entry=0x55ac93e394c0 <mjit_engine_mutex>) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:496
#4  0x000055ac9395cc19 in copy_cache_from_main_thread (job=0x7fbd65748e60) at /home/ko1/ruby/src/trunk-mjit-wait/mjit_worker.c:1189
#5  mjit_worker () at /home/ko1/ruby/src/trunk-mjit-wait/mjit_worker.c:1247
#6  0x000055ac93a51d0a in mjit_worker (arg=0x55ac9395c770 <mjit_worker>) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:1866
#7  0x00007fbd667a56db in start_thread (arg=0x7fbd65749700) at pthread_create.c:463
#8  0x00007fbd6586b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fbd66df3740 (LWP 30930)):
#0  0x00007fbd6585ecf6 in __GI_ppoll (fds=fds@entry=0x7ffd30485118, nfds=nfds@entry=1, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x000055ac93a58f5d in native_ppoll_sleep (th=th@entry=0x55ac952bde60, rel=rel@entry=0x0) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:2015
#2  0x000055ac93a5b22a in native_sleep (th=th@entry=0x55ac952bde60, rel=0x0) at /home/ko1/ruby/src/trunk-mjit-wait/thread_pthread.c:2051
#3  0x000055ac93a5baeb in rb_thread_sleep_interruptible () at /home/ko1/ruby/src/trunk-mjit-wait/thread.c:1270
#4  0x000055ac939bb4b5 in waitpid_sleep (x=x@entry=140725413499744) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1099
#5  0x000055ac938fa0d3 in rb_ensure (b_proc=b_proc@entry=0x55ac939bb4a0 <waitpid_sleep>, data1=data1@entry=140725413499744, e_proc=e_proc@entry=0x55ac939bb450 <waitpid_cleanup>, data2=data2@entry=140725413499744) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:1052
#6  0x000055ac939c2cf9 in waitpid_wait (w=0x7ffd30485360) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1158
#7  rb_waitpid (pid=<optimized out>, st=0x7ffd304853c4, flags=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:1197
#8  0x000055ac939c4e6e in rb_f_system (argc=3, argv=0x7fbd66cf2338) at /home/ko1/ruby/src/trunk-mjit-wait/process.c:4371
#9  0x000055ac93a90455 in vm_call_cfunc_with_frame (ci=0x55ac98d41800, cc=<optimized out>, calling=<optimized out>, reg_cfp=0x7fbd66df1cc8, ec=0x55ac952be2e8) at /home/ko1/ruby/src/trunk-mjit-wait/vm_insnhelper.c:1914
... (snip) ...
#35 0x000055ac938f90fb in ruby_exec_node (n=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:325
#36 ruby_run_node (n=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/eval.c:317
#37 0x000055ac938f41df in main (argc=<optimized out>, argv=<optimized out>) at /home/ko1/ruby/src/trunk-mjit-wait/main.c:42
>>> Dump Ruby level backtrace
th: 0x55ac952bde60, native_id: 0x7fbd66df3740
-- Control frame information -----------------------------------------------
c:0016 p:---- s:0107 e:000106 CFUNC  :system
c:0015 p:0024 s:0100 e:000099 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/fileutils/test_fileutils.rb:148
c:0014 p:0098 s:0095 e:000094 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/fileutils/test_fileutils.rb:167
c:0013 p:0044 s:0090 e:000089 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:1289
c:0012 p:0015 s:0081 e:000080 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/testcase.rb:18
c:0011 p:0082 s:0076 e:000075 BLOCK  /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:952 [FINISH]
c:0010 p:---- s:0069 e:000068 CFUNC  :map
c:0009 p:0133 s:0065 e:000064 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/minitest/unit.rb:945
c:0008 p:0045 s:0053 e:000052 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit.rb:1044
c:0007 p:0095 s:0046 E:0020b0 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:50
c:0006 p:0009 s:0030 e:000029 BLOCK  /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:25 [FINISH]
c:0005 p:---- s:0026 e:000025 CFUNC  :map
c:0004 p:0006 s:0022 e:000021 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:24
c:0003 p:0263 s:0016 E:001aa0 METHOD /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:122
c:0002 p:0128 s:0006 E:0021f0 EVAL   /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:207 [FINISH]
c:0001 p:0000 s:0003 E:0021f0 (none) [FINISH]

th: 0x55ac97a8d040, native_id: 0x7fbd66c90700
-- Control frame information -----------------------------------------------
c:0003 p:---- s:0012 e:000011 CFUNC  :readpartial
c:0002 p:0036 s:0007 e:000006 BLOCK  /home/ko1/ruby/src/trunk-mjit-wait/test/lib/test/unit/parallel.rb:39 [FINISH]
c:0001 p:---- s:0003 e:000002 (none) [FINISH]
~~~

In this case, 3 threads are blocking in:

1. `rb_thread_io_blocking_region` called from `rb_read_internal` called from `io_readpartial`
2. `native_ppoll_sleep` called inside `rb_waitpid`
3. (MJIT worker) `rb_native_cond_wait` called from `copy_cache_from_main_thread`

I think 3's lock is completely independent from blocking in 1 and 2, and I have no idea why 1 and 2 are blocking in that place forever.

## 2. in ruby_cleanup

http://ci.rvm.jp/results/trunk-mjit@silicon-docker/1428895
http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1429135

The stack trace on deadlock is like this:

~~~
0x00005587e954031c in verbose (level=3, format=format@entry=0x5587e973f692 "Unlocked %s", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:289
289	verbose(int level, const char *format, ...)
>>> Machine level backtrace

Thread 3 (Thread 0x7fd7f439d700 (LWP 11123)):
#0  0x00007fd7f3e57ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fd7f439cd70, expected=0, futex_word=0x7fd7f439ce88) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7fd7f439ce50, mutex=0x5587e9a1f660 <thread_cache_lock>, cond=0x7fd7f439ce60) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=cond@entry=0x7fd7f439ce60, mutex=mutex@entry=0x5587e9a1f660 <thread_cache_lock>, abstime=abstime@entry=0x7fd7f439ce50) at pthread_cond_wait.c:667
#3  0x00005587e963eca8 in native_cond_timedwait (abs=<synthetic pointer>, mutex=0x5587e9a1f660 <thread_cache_lock>, cond=0x7fd7f439ce60) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:516
#4  register_cached_thread_and_wait (altstack=0x7fd7e0000b20) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1079
#5  thread_start_func_1 (th_ptr=<optimized out>) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1024
#6  0x00007fd7f3e516db in start_thread (arg=0x7fd7f439d700) at pthread_create.c:463
#7  0x00007fd7f2f1788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fd7f2df5700 (LWP 17017)):
#0  0x00007fd7f2f0acf6 in __GI_ppoll (fds=fds@entry=0x7fd7f2def8e8, nfds=nfds@entry=1, timeout=<optimized out>, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00005587e963c757 in rb_sigwait_sleep (th=th@entry=0x0, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x0) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1959
#2  0x00005587e95a7a7c in ruby_waitpid_locked (vm=vm@entry=0x5587ea98a950, pid=pid@entry=11124, status=status@entry=0x7fd7f2def9dc, options=options@entry=0, cond=cond@entry=0x7fd7f2def9e0) at /home/ko1/ruby/src/trunk-mjit/process.c:1070
#3  0x00005587e953fd02 in exec_process (path=<optimized out>, argv=argv@entry=0x7fd7ec01ced0) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:657
#4  0x00005587e9541276 in compile_c_to_o (o_file=0x7fd7f2defa80 "/tmp/_ruby_mjit_p17016u76.o", c_file=0x7fd7f2defb70 "/tmp/_ruby_mjit_p17016u76.c") at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:842
#5  convert_unit_to_func (unit=0x5587eaf601f0, cc_entries=<optimized out>, is_entries=<optimized out>) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:1122
#6  0x00005587e9541af6 in mjit_worker () at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:1253
#7  0x00005587e9636dfa in mjit_worker (arg=0x5587e9541830 <mjit_worker>) at /home/ko1/ruby/src/trunk-mjit/thread_pthread.c:1866
#8  0x00007fd7f3e516db in start_thread (arg=0x7fd7f2df5700) at pthread_create.c:463
#9  0x00007fd7f2f1788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fd7f449f740 (LWP 17016)):
#0  0x00005587e954031c in verbose (level=3, format=format@entry=0x5587e973f692 "Unlocked %s", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:289
#1  0x00005587e95437a3 in CRITICAL_SECTION_FINISH (msg=0x5587e973f8e8 "in stop_worker", level=3) at /home/ko1/ruby/src/trunk-mjit/mjit_worker.c:448
#2  stop_worker () at /home/ko1/ruby/src/trunk-mjit/mjit.c:659
#3  mjit_finish () at /home/ko1/ruby/src/trunk-mjit/mjit.c:734
#4  0x00005587e94dde81 in ruby_cleanup (ex=<optimized out>) at /home/ko1/ruby/src/trunk-mjit/eval.c:236
#5  0x00005587e94de102 in ruby_run_node (n=<optimized out>) at /home/ko1/ruby/src/trunk-mjit/eval.c:317
#6  0x00005587e94d91df in main (argc=<optimized out>, argv=<optimized out>) at /home/ko1/ruby/src/trunk-mjit/main.c:42
>>> Dump Ruby level backtrace
th: 0x5587ea98ae60, native_id: 0x7fd7f449f740
-- Control frame information -----------------------------------------------
c:0001 p:0000 s:0003 E:002360 (none) [FINISH]
~~~

In this case, 3 threads are blocking in:

1. `native_cond_timedwait` called from `register_cached_thread_and_wait`
2. (MJIT worker) `rb_sigwait_sleep` called from `ruby_waitpid_locked` called from `compile_c_to_o`
3. (main thread) looping inside `stop_worker` called from `ruby_cleanup`

1 looks innocent and ignoreable. In 2, somehow it seems to have lost the process to wait, or locked with VM's lock. If the situation is the former, sometimes this CI machine is overloaded and thus it may happen on such an environment. And if the situation is the latter, I have no idea why it's locked.

@normalperson While I'm taking a look at this by myself, it would be helpful if you have any insight about this, since `waitpid` thing is involved in both issues. For now I have no approach to fix this because it's not reproductive on my local machine...

----------------------------------------
Bug #14867: Process.wait can wait for MJIT compiler process
https://bugs.ruby-lang.org/issues/14867#change-74615

* Author: k0kubun (Takashi Kokubun)
* Status: Assigned
* Priority: Normal
* Assignee: normalperson (Eric Wong)
* Target version: 
* ruby -v: 
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
If Ruby tries to wait for any child process, MJIT's gcc/clang process could be caught by the method call. It's not convenient for both Ruby's user and MJIT worker thread, so Process.wait and its families should somehow avoid waiting for it.

---Files--------------------------------
0001-hijack-SIGCHLD-handler-for-internal-use.patch (13.8 KB)
JIT-test-all.log (39.9 KB)
mjit_test-all_63796.log (40.4 KB)
config_ruby-loco_mingw.log (27 KB)
test_jit_results.txt (41.2 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>