takashikkbn / gmail.com wrote:
> Issue #14867 has been updated by k0kubun (Takashi Kokubun).
> 
> > I think we'll have to support non-blocking/event-based waitpid in Windows for auto-fiber/Thread::Light, anyways. So making MJIT worker thread be a Ruby Thread (or Thread::Light) would allow it to hook into process.c APIs. I don't think the existing code is too far off from being able to support Windows...

> I see. Dropping MJIT worker by non-blocking/event-based
> waitpid would be an option once it's supported for
> auto-fiber/Thread::Light. At this moment, I'm not sure which
> is larger, synchronization cost between threads or translating
> ISeq to C code synchronously. So in a very short term, I wanna
> fix the 1 (waitpid) case with postponed_job unless it has a
> critical blocker to fix the issue. 

I suppose MJIT has priority over Thread::Light at the moment
since MJIT is already in trunk and people know about it.
I can take some time to investigate moving MJIT over to
event-based waitpid in a platform-independent way.

The other thing is MJIT multi-threading/synchronization seems
tricky-to-debug right now, and making it event-based should
be beneficial for reliability.

> Note that the current synchronization design is already
> achieving the better benchmark result than one by copying
> inline caches on enqueueing an ISeq. To simplify the situation
> by stop using postponed_job, I experimented to have another
> global variable for the synchronization job which is to be
> checked and invoked by `RUBY_VM_CHECK_INTS`, but the benchmark
> result was not good. So for now using postponed_job is the
> only possible approach that achieves the current performance.

OK, thank you for that information.  So checking one extra global
variables with RUBY_VM_CHECK_INTS introduces a measurable perf
hit?

> > rb_postponed_job_register only sets a flag, but doesn't wake up sleeping the thread in 1. or 2. by calling ubf.func (via rb_threadptr_interrupt).
> >
> > This is a tricky situation...
> >
> > Calling ubf.func is NOT async-signal-safe, so rb_postponed_job_register may not use it by default, either.

> I'm not still understanding why this could be related to the
> deadlock 1 (waitpid on #system, #`, or Process.waitpid). I
> assume that the sleeping threads "1." and "2." should be waken
> up by something prepared in threads "1." and "2.", not by MJIT
> worker thread "3.". So the fact that rb_postponed_job_register
> (called by MJIT worker thread "3.") may not wakeup the thread
> "1." or "2." should not be the cause of this deadlock, in my
> understanding.

Ah, so the waitpid from #system is on /bin/rm (I missed that earlier)
Is `rm' stuck (NFS, drive error, etc)?   But yes, I also wonder
if there's a bug in the current system+waitpid implementation.

Perhaps #system needs to use similar mechanism as mjit_worker to
prevent work-stealing.

Regardless, ubf.func need to be called after registering postponed
job, so there was still a problem, there.

> I actually don't understand how threads "1." and "2." are
> usually waken up. For waitpid thread "2.", I'm guessing that a
> signal handler (of SIGCHLD?) registered by the thread "2."
> itself somehow *replaces* native_ppoll_sleep and finishes it,
> but not sure.

System-level signal handler is only registered at startup, and
Ruby Signal.trap handler is process-wide and only runs in main
thread.  native_ppoll_sleep can be called by any thread
(which exclusively acquires sigwait_fd).

> > Also ruby_current_execution_context_ptr variable is unstable between setting ec->interrupt_flag (via RUBY_VM_SET_POSTPONED_JOB_INTERRUPT) and ubf.func calls since we make them without GVL
> >
> > This is a similar situation to [Bug #14939] r64062

> I recognize this issue. For this issue, @ko1 suggested to
> repeatedly calling `rb_postponed_job_register_one` (that
> registers the same job at most only once) to survive the ec
> switches. I can fix this issue, but I'm NOT thinking that this
> could be the solution for this deadlock if unblocking thread
> "3." waiting for postponed job may not wake up threads "1."
> and "2." as I understand.

Right.

> I *assume* that threads "1." and "2." are not waken up because
> something that should wake up "1." and "2." is not working
> somehow with MJIT enabled. At first @ko1 guessed that
> `rb_postponed_job_register` swaps `ec->interrupt_flag` in a
> bad way and a flag for the signal handler for waitpid is
> skipped to be set, but actually we're always using `ATOMIC_OR`
> for modifying `ec->interrupt_flag`, and so it may not be the
> cause either.

Agreed, and the self-pipe/eventfd would've woken
native_ppoll_sleep, anyways.

(But I need to fix a car problem, first :<)

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>