Issue #12095 has been updated by Evan Phoenix.


I'm hitting this as well, and looking over the code in question on 2.3.0, I wondering if the problem is that the at_exit pseudo-object is actually allocated within the body of rb_vm_t. It's address is taken and passed to `rb_ary_push`, which perform OBJ_WRITE. That's where wb_incremental is invoked from.

Because the mark bits are not located with the object header anymore, the mark bitmap is consulted but the position in the mark bitmap is calculated against the address of at_exit, which isn't located on the main ruby heap at all!

The path to the bad pointer, given X as the address of at_exit within rb_vm_t is: RVALUE_BLACK_P(X) => RVALUE_MARKED(X) => RVALUE_MARK_BITMAP(X) => GET_HEAP_MARK_BITS(X) => GET_HEAP_PAGE(X) => GET_PAGE_HEADER(X) => GET_PAGE_BODY(X) => ((struct heap_page_body *)((bits_t)(x) & ~(HEAP_ALIGN_MASK))).

The value returned by that above sequence is supposed to return a page header that can itself be dereferenced to find the mark bits. But because the at_exit is in a random place, the page header is basically random bytes, and thus the deference crashes.


----------------------------------------
Bug #12095: ruby_vm_at_exit can sometime cause a crash.
https://bugs.ruby-lang.org/issues/12095#change-57432

* Author: Nicolas Noble
* Status: Open
* Priority: Normal
* Assignee: 
* ruby -v: 
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN, 2.3: UNKNOWN
----------------------------------------
This behavior has been seen erratically, but one of our users got it to reproduce almost systematically. We didn't managed to understand what made his system special that it would get that crash to reproduce so well.

Here's one of the reports:

https://gist.github.com/blowmage/7ebe774039013bc8c990


The current workaround to that one (alongside a few other comments) is done here: https://github.com/grpc/grpc/pull/5337/files

Note that removing the call to ruby_vm_at_exit makes everything load fine. Also note that the removed comment from that pull request is wrong: this has been happening on versions of Ruby other than 2.0.

It's interesting to note from the backtrace information that this is happening during a garbage collection. The fact that a garbage collection happens at that exact moment is probably the reason that bug is so difficult to reproduce. Perhaps a modified version of ruby might help reproducing it. Or very specific garbage collector settings.

The fault address (0x88) seems to indicate that a NULL pointer into a struct was being dereferenced.

Disassembling the corresponding execution address seems to point at a crash inside obj_info, from the first line of gc_writebarrier_incremental, but this is after a very quick inspection of the code, so don't take my word from it.

This problem has been repoted to us on Ruby 2.0.0, Ruby 2.2.0, Ruby 2.2.3, Ruby 2.3.0, at least.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>