Vladimir Makarov <vmakarov / redhat.com> wrote:
> On 02/06/2018 02:38 PM, Eric Wong wrote:
> >Vladimir Makarov <vmakarov / redhat.com> wrote:
> >>On 02/06/2018 05:00 AM, Eric Wong wrote:
> >>>during rebuild.  Disabling the free(tab->entries) at line
> >>>st.c:792 (patch below) seems to indicate success with the
> >>>thread_safe test suite (letting it loop overnight).
> >It still crashed after four runs :<   It might run longer with
> >the simplecov/coveralls stuff commented out in spec_helper.rb
> >since coverage creates a giant hash and might increase the chance
> >of failure.
> >
> >>Eric, thank you for working on the problem and analyzing it. I'll look at
> >>this and try to fix it as soon as possible.
> >
> I reproduced this crash although the reproducing is not stable with or
> without valgrind.
> 
> It is a typical data race. The same problem existed in the **old hash
> tables**. It also rebuilt tables and freed old data structure.

Ah, thank you for your investigation.  I missed this earlier;
but I see a problem with the uncommon calls to .eql? and .hash
via rb_any_cmp and obj_any_hash functions which may invoke
scheduling in the middle of a lookup.

> **File st.c was never thread-safe**. The data races are/were possible in
> many places.
>
> We could make st.c thread-safe. But I don't think it is a right way. It is
> not a trivial task and it also will hurt performance considerably. We still
> needs thread-unaware level to work with hash tables (st.c) for cases when
> tables are used internally in one thread.

100% agreed.  st.c performance cannot be compromised in the
common case for thread-safety.

> So I think the crash should be fixed in other places where calls of st.c
> happen.

> I don't know how it should be fixed because I don't know Ruby thread
> semantics. Does Ruby guarantee that there are no data races or should a
> ruby programmer still provides thread synchronization despite GIL? If it is
> later, thread_safe gem is probably buggy because one thread reading a table
> and another thread inserting elements while process table in a Ruby block.
> If there is no sync it is a typical data race and the result is
> unpredictable. In this case it is a segfault crash. We could just give a
> better message about the data races if segfault happens in st.c.

Yes, thread_safe does not live up to its name, apparently.

Regardless of its bugginess, I don't believe segfaults should
be triggerable from pure Ruby code.

(I do not speak for the rest of the team)

> Also I don't know how GIL works. Where the thread switching can happen. Is
> the switch possible in find_table_ind or we just read unsync cashed value
> in the thread because st.c never used atomics.
> 
> Unfortunately I am not well familiar with Ruby threads so it is hard for me
> to say how to fix it. I only think that we should keep st.c thread-unaware
> as it always was.

I agree st.c should be thread-unaware.

One option is to disable thread scheduling during .eql? and
.hash calls; not sure how complex that would be (basically
reintroducing Thread.exclusive, which was recursive).

Perhaps some form of deferred reclamation (either GC or
Userspace-RCU/QSBR/EBR) can be employed during uncommon
rebuilds...

Will think about it; but I won't have much time for Ruby in
the next month or so.

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>