Issue #15626 has been updated by PikachuEXE (Pikachu Leung).


Made some changes for "auto compaction"
I am a newbie in C programming
This is just for discussion, not tested nor compiled yet

https://github.com/ruby/ruby/compare/trunk...PikachuEXE:feature/gc-compact/pika


----------------------------------------
Feature #15626: Manual Compaction for MRI's GC (`GC.compact`)
https://bugs.ruby-lang.org/issues/15626#change-77217

* Author: tenderlovemaking (Aaron Patterson)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Hi,

I've attached a patch that implements a compactor for MRI.  I'll try to describe the patch in detail below, but the biggest user facing changes are:

1. A new method `GC.compact` that can be called from Ruby
2. C extensions have a new "after compaction" callback they can register
3. Introduction of new marking functions that allow objects to be marked, but not "pinned" to the same location
4. A function to get the new location of an object (to be used from the "after compaction" callback)

# Compactor for RGenGC

This document describes the compactor implementation for RGenGC.  This patch
adds a new method to Ruby, `GC.compact` which will compact the objects in the
heap.  What follows is a description of the algorithm, along with achievements
and technical challenges.

## Steps for Compaction

These are the high level steps taken in `GC.compact` are as follows:

1. Full GC
2. Move objects
3. Update references
4. Full GC

### Step One, Full GC

Not all objects in Ruby's heap can move.  In particular, C extensions may hold
references to VALUE pointers.  If the VALUE pointers move, the C extension may
not be able to find them and will get a SEGV or possibly wrong data.

To prevent this, a new "pinning" table was introduced.  Any object marked via
`rb_gc_mark` is "pinned" and is not allowed to move.  C extensions must mark
their references in order to keep them alive.  Since C extensions use
`rb_gc_mark` to mark a reference, we know not to move that reference.  In order
to ensure all references get marked, we perform a full GC.

New functions, `rb_gc_mark_no_pin`, etc were introduced and used internally.
These functions keeps the object alive but without pinning the reference.

Summary: Perform a full GC so that objects marked with `rb_gc_mark` do not move.

### Step Two, Move Objects

This compactor uses a "two finger" algorithm which was introduced in "The
Programming Language LISP: Its Operation and Applications" (page 228)[^1].  Two
pointers point at each side of the heap, if one slot is empty, and the other is
moveable, it swaps places and leaves a `T_MOVED` object that contains a
forwarding address.

In Ruby, the algorithm looks something like this:

```ruby
def compact
  heap = [ ... ] # many slots
  left  = 0
  right = heap.length - 1

  while left < right
    left_slot = heap[left]
    right_slot = heap[right]

    if is_empty?(left_slot) && !is_empty?(right_slot) && can_move?(right_slot)
      swap(left, right)
      heap[right] = T_MOVED.new(left) # leave forwarding address
    end

    while !is_empty?(heap[left])
      left += 1
    end

    while is_empty?(heap[right]) || !can_move?(heap[right])
      right -= 1
    end
  end
end
```

If we have a heap with 6 slots, before compaction it might look something like
this:

```
                     иг иб иб иб иб иб иб иб иб иб иб иб        
                        Before Compaction   ив       
                     иж иб иб иб иб иб иб иб иб иб иб иб        
               игибибибибибииибибибибибииибибибибибииибибибибибииибибибибибииибибибибибид
 Slot Index    ив  1  ив  2  ив  3  ив  4  ив  5  ив  6  ив
               изибибибибибилибибибибибилибибибибибилибибибибибилибибибибибилибибибибибий
Slot Contents  ив  A  ив  B  ивEmptyивEmptyив  C  ив  D  ив
               ив     ив     ив     ив     ив     ив     ив
               ижибибибибибикибибибибибикибибибибибикибибибибибикибибибибибикибибибибибие
```

After compaction, but before reference update, it will look like this:

```
                     иг иб иб иб иб иб иб иб иб иб иб иб        
                         After Compaction   ив       
                     иж иб иб иб иб иб иб иб иб иб иб иб        
               игибибибибибииибибибибибииибибибибибииибибибибибииибибибибибииибибибибибид
 Slot Index    ив  1  ив  2  ив  3  ив  4  ив  5  ив  6  ив
               изибибибибибилибибибибибилибибибибибилибибибибибилибибибибибилибибибибибий
Slot Contents  ив  A  ив  B  ив  D  ив  C  ивMOVEDивMOVEDив
               ив     ив     ив     ив     ивTO 4 ивTO 3 ив
               ижибибибибибикибибибибибикибибибибибикибибибибибикибибибибибикибибибибибие
```

Slots 5 and 6 contain `T_MOVED` objects that point to the new locations of D and
C objects.

### Step Three, Update References

After objects have moved, references are updated.  This is a matter of updating
any references to use the forwarding address rather than the original location.

For example, if object A has a reference to C, and B a reference to D:

```
                     иг иб иб иб иб иб иб иб иб иб иб иб        
                        Before Compaction   ив       
                     иж иб иб иб иб иб иб иб иб иб иб иб        
               игибибибибибииибибибибибииибибибибибииибибибибибииибибибибибииибибибибибид
 Slot Index    ив  1  ив  2  ив  3  ив  4  ив  5  ив  6  ив
               изибибибибибилибибибибибилибибибибибилибибибибибилибибибибибилибибибибибий
Slot Contents  ив  A  ив  B  ивEmptyивEmptyив  C  ив  D  ив
               ив     ив     ив     ив     ив     ив     ив
               ижибибибибибикибибибибибикибибибибибикибибибибибикибибибибибикибибибибибие
                  ив     ив                 ве     ве   
                  ив     ив                 ив     ив   
                  ижибибибибибилибибибибибибибибибибибибибибибибибие     ив   
                        ижибибибибибибибибибибибибибибибибибибибибибибибие   
```

After compaction, but before updating references, the heap will look like this:

```
                     иг иб иб иб иб иб иб иб иб иб иб иб        
                         After Compaction   ив       
                     иж иб иб иб иб иб иб иб иб иб иб иб        
               игибибибибибииибибибибибииибибибибибииибибибибибииибибибибибииибибибибибид
 Slot Index    ив  1  ив  2  ив  3  ив  4  ив  5  ив  6  ив
               изибибибибибилибибибибибилибибибибибилибибибибибилибибибибибилибибибибибий
Slot Contents  ив  A  ив  B  ив  D  ив  C  ивMOVEDивMOVEDив
               ив     ив     ив     ив     ивTO 4 ивTO 3 ив
               ижибибибибибикибибибибибикибибибибибикибибибибибикибибибибибикибибибибибие
                  ив     ив                 ве     ве   
                  ив     ив                 ив     ив   
                  ижибибибибибилибибибибибибибибибибибибибибибибибие     ив   
                        ижибибибибибибибибибибибибибибибибибибибибибибибие   
```

The update references step looks at each object in the heap, checks to see if it
references any moved slots, then updates the reference to use the new location.
In Ruby, it looks like this:

```ruby
def update_references
  heap.each do |slot|
    next if is_empty?(slot) || is_moved?(slot)

    slot.references.each_with_index do |child, i|
      if is_moved?(child)
        slot.set_reference(i, child.new_location)
      end
    end
  end
end
```

After reference updates, the heap will look like this:

```
                     иг иб иб иб иб иб иб иб иб иб иб иб        
                        After Ref Updates   ив       
                     иж иб иб иб иб иб иб иб иб иб иб иб        
               игибибибибибииибибибибибииибибибибибииибибибибибииибибибибибииибибибибибид
 Slot Index    ив  1  ив  2  ив  3  ив  4  ив  5  ив  6  ив
               изибибибибибилибибибибибилибибибибибилибибибибибилибибибибибилибибибибибий
Slot Contents  ив  A  ив  B  ив  D  ив  C  ивMOVEDивMOVEDив
               ив     ив     ив     ив     ивTO 4 ивTO 3 ив
               ижибибибибибикибибибибибикибибибибибикибибибибибикибибибибибикибибибибибие
                  ив     ив     ве     ве               
                  ив     ив     ив     ив               
                  ижибибибибибилибибибибибилибибибибибие               
                        ижибибибибибие                     
```

### Step Four, Full GC

After references have been updated, a full GC is performed.  This converts the
`T_MOVED` slots back in to `T_EMPTY` slots.  The GC automatically takes care of
this because no object should reference a `T_MOVED` slot.

## Non Moving Objects

Non moving objects are:

1. Objects on the stack
2. Objects marked with `rb_gc_mark`
3. Hash key objects

### Objects on the stack

The stack should not be mutated, so these are pinned.

### Objects marked with `rb_gc_mark`

As mentioned earlier, any object marked with `rb_gc_mark` may not move because a
C extension may hold a reference.  Here is an excerpt from Yajl, a JSON parser.
This is the mark function it uses:

```c
static void yajl_encoder_wrapper_mark(void * wrapper) {
    yajl_encoder_wrapper * w = wrapper;
    if (w) {
        rb_gc_mark(w->on_progress_callback);
        rb_gc_mark(w->terminator);
    }
}


obj = Data_Make_Struct(klass, yajl_encoder_wrapper, yajl_encoder_wrapper_mark, yajl_encoder_wrapper_free, wrapper);
```

The values in this mark function `w->on_progress_callback` and `w->terminator`
will not move since they are pinned by `rb_gc_mark`.

### Hash key objects

Certain hash keys will not move.  Most objects use their address as the hash
value.  If the object moves, the hash value could change, so hash keys are not
allowed to move.

There is an exception for string objects.  String objects calculate their hash
based on the bytes in the string.  Since the bytes in the string do not change
after move, they are allowed to move.  Most hash keys in large programs are
strings, so only allowing string keys to move seemed enough.

## Special Considerations

### Object ID

`Object#object_id` bases its value from the address of the object.  For example:

```ruby
x = Object.new
id = x.object_id
GC.compact        # `x` moves
id == x.object_id # `id` should be same as `x.object_id`
```

We expect the object id to remain the same regardless of compaction.  To deal
with `object_id`, two hashes were introduced.  One hash contains a map of the
object to the *seen* object id.  The second map contains all seen object ids.
The maps are updated any time `object_id` is called, an object moves, or an
object is freed.

object_id implementation in Ruby:

```ruby
$obj_to_id = {}
$id_to_obj = {}

class Object
  def object_id
    if $obj_to_id.key?(address)
      # Second call to object_id on this object
      return $obj_to_id[address]
    else
      # First call to object_id on this object

      addr = self.address

      loop do
        if $seen_ids.key?(addr)
          # Resolve conflict
          addr += sizeof(RVALUE)
        else
          $obj_to_id[self.address] = addr
          $id_to_obj[addr]         = self
          return addr
        end
      end
    end
  end

  private

  def address
    # returns address in memory of object
  end
end
```

During compaction:

```ruby
def compact
  heap = [ ... ] # many slots

  while left < right
    dest_slot = heap[left]
    source_slot = heap[right]

    if moving?(source_slot)
      if $obj_to_id.key?(source_slot.address)
        id = $obj_to_id.delete(source_slot.address)
        $obj_to_id[dest_slot.address] = id
        $id_to_obj[id]                = dest_slot.address
      end

      # etc
    end
  end
end
```

During Free:

```ruby
def free(obj)
  if $obj_to_id.key?(obj.address)
    $seen_ids.delete($obj_to_id.delete(obj.address))
  end
end
```

### ObjectSpace._id2ref

When generating an object id, in the case of a collision, the address is
incremented by 40.  This enables `_id2ref` to determine there is a VALUE pointer
and round trip the object correctly.

## Compaction Support for C extensions

Any object marked with `rb_gc_mark` will be "pinned" and cannot move.  C
extensions mark objects via `rb_gc_mark`, so this ensures that pointers in C
stay valid.

However, some C extensions may want to support reference movement.

### Reference Movement in C extensions

In order to maintain backwards compatibility, if a C extension holds a reference
to a VALUE, the VALUE should not move.  Going forward, C extensions can support
moving VALUEs.  To support moving VALUEs, a C extension should change
`rb_gc_mark` to `rb_gc_mark_no_pin`, then implement a new callback that is
called during compaction that gives the extension an opportunity to update its
own references.  A new function `rb_gc_new_location` will return the new
location for a particular VALUE.

Here is an example for autoload from `variable.c`.  A new compaction callback is
registered, `autoload_i_compact`:

```c
static const rb_data_type_t autoload_data_i_type = {
    "autoload_i",
    {autoload_i_mark, autoload_i_free, autoload_i_memsize, autoload_i_compact},
    0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

The mark function changes to *mark* references, but *not* pin them:

```c
static void
autoload_i_mark(void *ptr)
{
    struct autoload_data_i *p = ptr;

    rb_gc_mark_no_pin(p->feature);

    /* allow GC to free us if no modules refer to this via autoload_const.ad */
    if (list_empty(&p->constants)) {
        rb_hash_delete(autoload_featuremap, p->feature);
    }
}
```

After the heap is compacted, any C extensions that registered a "compaction"
callback will be called and have a chance to update internal references.  The
autoload compaction function is like this:

```c
static void
autoload_i_compact(void *ptr)
{
    struct autoload_data_i *p = ptr;
    p->feature = rb_gc_new_location(p->feature);
```

The compaction callback asks for the new location for the `p->feature`
reference.  `rb_gc_new_location` will either return the current value of
`p->feature` (in the case the VALUE did not move), or the new location.

## Reference Verification and Testing

After compaction, objects that have moved are changed to `T_MOVED` objects with
forwarding addresses.  Once all references are updated, no object should point
to a `T_MOVED` slot.  We can say that before and after compaction, no object
should ever point at a `T_MOVED` slot.  So if a `T_MOVED` object is ever pushed
on to the mark queue, there is a bug.  `push_mark_stack` has a check for
`T_MOVED` objects, and will crash if a `T_MOVED` object is ever pushed on to the
mark stack.

A method `GC.verify_compaction_references` was added that doubles the available
heap size, then compacts the heap.  Since the heap has doubled in size, any
object that can move will move.  Any references to `T_MOVED` objects will be
caught during the GC phase after compaction.

## Known Issue

Safety for C extensions depends entirely on C extensions marking their
references.  If a C extension does not mark a reference, the compactor assumes
that the object is safe to move.  This can cause an error in the following
situation, when there is an object graph like this:


```
 игибибибибибибибибибибибибибибибид       игибибибибибибибибибибибибибибибид 
 ив  Ruby Object  ив       ив  Ruby Object  ив 
 ив  Implemented  ив       ив  Implemented  ив 
 ив    in Ruby    ив       ив     in C      ив 
 ив               ив       ив               ив 
 ижибибибибибибибибибибибибибибибие       ижибибибибибибибибибибибибибибибие 
         ив                       ив         
         ив                                 
         ив                       ив   NOT   
 Marked  ив                          Marked 
         ив                       ив         
         ив   игибибибибибибибибибибибибибибибид             
         ив   ив               ив   ив         
         ив   ив  Ruby Object  ив             
         ижибибив               ивиб ие         
             ив               ив             
             ижибибибибибибибибибибибибибибибие             
```

If the C extension contains a reference to an object, but expects the object
not to move because a Ruby object contains a reference, then the target Ruby
object *may* move and the reference in the C extension will be wrong.  I like to
call these "ghost references" because the GC cannot see them but they will come
back to haunt you.

The solution to this is either:

1. Change the C extension to `rb_gc_mark` the object
2. Remove the reference from the C extension

One example of this situation was fixed here:

  https://github.com/msgpack/msgpack-ruby/issues/133

## Summary

Backwards compatibility with C extensions is maintained by "pinning" any
references marked with `rb_gc_mark`.  A full mark *must* be done before
compacting to discover all pinned references.

A new callback for C extensions is introduced so that C extensions can support
compaction.  C extensions that wish to support compaction must use the new
callback, `rb_gc_mark_no_pin`, and `rb_gc_new_location`.

C extensions that maintain pointers but do not mark those pointers may SEGV.  We
can use `GC.verify_compaction_references` to discover these issues.

[^1]: See also "The Garbage Collection Handbook" page 32


---Files--------------------------------
before_compact-0.png (111 KB)
after_compact-0.png (80.6 KB)
before_compact-1.png (81.2 KB)
after_compact-1.png (79.4 KB)
0001-GC-Compaction-for-MRI.patch (77.4 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>