takashikkbn / gmail.com wrote:
> In this case, 3 threads are blocking in:
> 
> 1. `rb_thread_io_blocking_region` called from `rb_read_internal` called from `io_readpartial`
> 2. `native_ppoll_sleep` called inside `rb_waitpid`
> 3. (MJIT worker) `rb_native_cond_wait` called from `copy_cache_from_main_thread`

rb_postponed_job_register only sets a flag, but doesn't wake up
sleeping the thread in 1. or 2. by calling ubf.func (via
rb_threadptr_interrupt).

This is a tricky situation...

Calling ubf.func is NOT async-signal-safe, so rb_postponed_job_register
may not use it by default, either.

Also ruby_current_execution_context_ptr variable is unstable
between setting ec->interrupt_flag (via
RUBY_VM_SET_POSTPONED_JOB_INTERRUPT) and ubf.func calls since we
make them without GVL

This is a similar situation to [Bug #14939] r64062

> I think 3's lock is completely independent from blocking in 1
> and 2, and I have no idea why 1 and 2 are blocking in that
> place forever.

I'm not sure if rb_postponed_job_register is the right tool
in a multi-threaded situation.  It seems like the "postponed"
part is a bad fit for MJIT anyways.

Anyways, the above issue is pretty straightforward, I think.

> ## 2. in ruby_cleanup
> 
> In this case, 3 threads are blocking in:

Not sure about this one, yet:

> 1. `native_cond_timedwait` called from `register_cached_thread_and_wait`
> 2. (MJIT worker) `rb_sigwait_sleep` called from `ruby_waitpid_locked` called from `compile_c_to_o`
> 3. (main thread) looping inside `stop_worker` called from `ruby_cleanup`
> 
> 1 looks innocent and ignoreable.

Not sure, this is a timeout situation?

THREAD_CACHE_TIME is only 3 seconds, so I think the cache entry
would've timed out if a whole test hits timeout.

> In 2, somehow it seems to have lost the process to wait, or
> locked with VM's lock. If the situation is the former,
> sometimes this CI machine is overloaded and thus it may happen
> on such an environment. And if the situation is the latter, I
> have no idea why it's locked.

Can you tell if the process 2. is waiting on is still a zombie?
To debug, maybe always return &busy_wait from sigwait_sleep_time
and check the contents of vm->waiting_pids periodically.

You may also periodically kill(pid, 0) to see if the process is
killable.

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>