Issue #12628 has been updated by sam.saffron (Sam Saffron).


looks like this regressed https://bugs.ruby-lang.org/issues/13772 

----------------------------------------
Feature #12628: change block/env structs
https://bugs.ruby-lang.org/issues/12628#change-65968

* Author: ko1 (Koichi Sasada)
* Status: Closed
* Priority: Normal
* Assignee: ko1 (Koichi Sasada)
* Target version: 
----------------------------------------
I will change block/env structures for performance.

----

I'm not sure who interests about this area. But it will be big change.

# Issues

Now, MRI has several problems.

(1) we need to clear `rb_control_frame_t::block_iseq` for every frame setup. It consumes space (a `VALUE` for each frame) and initializing time.
(2) There are several block passing ways by ISeq (`iter{...}`), Proc(`iter(&pr)`), Symbol(`iter(:sym)`). However, they are not optimized (for Symbol blocks, there is only ad-hoc check code).
(3) Env (and Proc, Binding) objects are not WB-protected ([Bug #10212]).

# Proposal

To solve them, I wrote a big patch.
https://github.com/ruby/ruby/compare/trunk...ko1:block_code

## Introduce Block Handler (BH)

For Issues (1) and (2), I introduced a concept "Block Handler" (BH).

### Current implementation

Now, `rb_block_t` pointers are passed to represent given blocks.

`rb_block_t` has the following types.
(1) A part of current control frame (with `block_iseq = iseq`) (`iter{...}`)
(2) proc body (`iter(&pr)`)
(3) A part of current control frame (with `block_iseq = :sym`) (`iter(&:sym)`)
(for internal, there are (4) `ifunc`, for C implemented block)

They are placed on the frame of passed method (as a local variable (`ep[0]`)).

To mark Proc on GC for (2), we prepare `rb_block_t::proc` (== `rb_control_frame_t::block_iseq`).

### Using BH

To remove `rb_block_t::proc` (== `rb_control_frame_t::block_iseq`),
we introduce BH to put Proc or Symbol directly as given block (they are located as a special local variable).

Proc and Symbol are normal objects so that we can put them without any concern.
We need to think about `iseq` and `ifunc` type ((1) and (4)).

To make it clear, I introduced `struct rb_captured_block` to represent a set of `self`, local variables (`ep`) and `iseq` (or `ifunc`). (now `rb_block_t` represents same set)

Passed blocks with `iseq` (`iter{...}`) are represented with a pointer of `rb_captured_block`. 
Such pointers are not managed VALUE, so that we add a tag for such pointers.

* `ptr | 0x01` -> pointer to captured_block contains iseq
* `ptr | 0x03` -> pointer to captured_block contains ifunc (for internal)

Tagged pointers are recognized as Fixnum by GC.

(Note that current implementation uses this tagged pointer to represent "local frame" (no previous Env) flag.
Instead of tagged information, we introduce `VM_ENV_FLAG_LOCAL` as a frame flag for this purpose.
See next chapter about "ENV_FLAG"s)

We can recognize a type of passed BH with the following rule:

(0) BH == VM_BLOCK_HANDLER_NONE (== 0) -> no block given
(1) (BH & 0x03) == 0x01 -> pointer to captured_block contains iseq
(2) (BH & 0x03) == 0x02 -> pointer to captured_block contains ifunc
(3) SYMBOL_P(BH)        -> Symbol
(4) Otherwize           -> Proc

This is what `vm_block_handler_type(VALUE block_handler)` does.

To invoke passed block represented by BH, we need to check the type of each BH with `vm_block_handler_type(VALUE block_handler)`. There are several extra overhead because current implementation only need to check rb_block_t::iseq (this can contains iseq, ifunc and Symbol). However I believe it is more simple and readable.
In fact, "invoke block" benchmark (vm1_yield) is faster.

I renamed `rb_block_t` to `struct rb_block` to represent a escaped block which is stored by Proc or Binding.
We introduce `rb_block::type` to represent a type corresponding BH's type.
`rb_block::as` is a union type to represent a block body specified by `type`.
We can convert `rb_block` <-> BH each others.

```C
struct rb_block {
    union {
	struct rb_captured_block captured;
	VALUE symbol;
	VALUE proc;
    } as;
    enum rb_block_type type;
};
```

To check the type of block, we should use `vm_block_type()` instead of check `rb_block_t::type` directly because there are several assertions (when VM_CHECK_MODE > 0).

### Short summary

(1) Introduce `struct rb_captured_block` to represent a set of `self`, variables (`ep`), and `code` (`iseq` or `ifunc`).
Usually the space of this type are the caller's control frame.
(2) For methods called with block, they receive "Block Handler" (BH) represents a passed block. It should be a tagged `struct rb_captured_block` (seems as Fixnum), Proc object or Symbol object.
(3) Caller method with block (== iterator) invokes block by checking given BH type. We can check BH type with `vm_block_handler_type()`.
(4) To make Proc, convert BH to `struct rb_block`.

## Introduce WB for Env objects

WB is important for generational and incremental GC (for issues (3)). We can run MRI without WB for all objects because of RGenGC "wb-unprotected" technique. In fact, we don't introduce WBs for `RubyVM::Env` (Env) objects because it has performance impact to introduce WB for this objects. This means that all of assignments to local variables should check WB needed or not.

However, there are several performance regression. For example, if an application creates many Proc objects, corresponding Env objects are created and they should be marked each minor GC (because they are wb-unprotected). This is what the ticket [Bug #10212] shows.

So we need to achieve "low latency WB (for Env objects)".

Current MRI's local variable assignment:

```C
    /* actual assignment in insns.def, setlocal instruction */
    *(ep - idx) = val;
```

Naive implementation with WB will be:

```C
#define VM_EP_IN_HEAP_P(th, ep)   (!((th)->stack <= (ep) && (ep) < ((th)->stack + (th)->stack_size)))

   if (VM_EP_IN_HEAP_P(ep)) {
     RB_OBJ_WRITE(VM_ENV_EP_ENVVAL(ep), ep-idx, val);
   }
   else {
     *(ep - idx) = val;
   }
```

It is correct, but not so fast code (in fact, it is too slow when Env is in heap (== escaped)).

### Approach

At first we need to check the local variables are located on the (1) VM stack or (2) Env. We don't need to protect with WB for (1) because VM stacks are root for every GC.

To make it simple, we move `rb_control_frame_t::flags` to `ep[0]` (as a special local variable) and introduce `VM_ENV_FLAG_ESCAPED`. We can easily check "on stack" (`flags & VM_ENV_FLAG_ESCAPED == 0`) or "escaped" (== on Env) (`flags & VM_ENV_FLAG_ESCAPED != 0`). We don't need to compare with VM stack range.

To locate flags onto `ep` (local variables), I cleanup managed data area on local variables.

```C
#define VM_ENV_DATA_SIZE             ( 3)

#define VM_ENV_DATA_INDEX_ME_CREF    (-2) /* ep[-2] */
#define VM_ENV_DATA_INDEX_SPECVAL    (-1) /* ep[-1] */
#define VM_ENV_DATA_INDEX_FLAGS      ( 0) /* ep[ 0] */
#define VM_ENV_DATA_INDEX_ENV        ( 1) /* ep[ 1] */
#define VM_ENV_DATA_INDEX_ENV_PROC   ( 2) /* ep[ 2] */
```

It means that 3 (== VM_ENV_DATA_SIZE) special local variables are allocated for each frame (index -2 to 0).
(Note that index 1 and 2 is only used by escaped Env)
Current MRI already has 2 special local variables (me_cref and special).
I introduced macro name to avoid magic numbers.

To respect this local variable layout, compile.c requires several fixes and `rb_iseq_t::local_size` is no longer needed (we can calculate local variable number with `local_table_size` with `VM_ENV_DATA_SIZE`.

Another optimization is introducing `VM_ENV_FLAG_WB_REQUIRED` flag.
It is very tricky and danger method so we should not use this hack in other places.
This flag is tightly connected to the current GC implementation.

We need WB protection for "non remembered old objects (or gray objects on incremental GC)". When the old objects are remembered, we don't need WB protection any more until next marking. So `VM_ENV_FLAG_WB_REQUIRED` shows this status.

(1) At initializing Env objects, `VM_ENV_FLAG_WB_REQUIRED` is true.
(2) At first local variable assignment, `VM_ENV_FLAG_WB_REQUIRED` is true, so we insert WB protection for this Env object. And turn off this flag.
(3) At next local variable assignment, `VM_ENV_FLAG_WB_REQUIRED` is false, so we can ignore WB protection.
(4) At GC marking for this Env object, we turn off `VM_ENV_FLAG_WB_REQUIRED` and goto (2).

The time (2) and (4) could be enough long so only a few WB protection is needed.

At last, local variables assignment code is like the following.

```C
NOINLINE(static void vm_env_write_slowpath(const VALUE *ep, int index, VALUE v));

static void
vm_env_write_slowpath(const VALUE *ep, int index, VALUE v)
{
    /* remember env value forcely */
    rb_gc_writebarrier_remember(VM_ENV_ENVVAL(ep));
    VM_FORCE_WRITE(&ep[index], v);
    VM_ENV_FLAGS_UNSET(ep, VM_ENV_FLAG_WB_REQUIRED);
}

static inline void
vm_env_write(const VALUE *ep, int index, VALUE v)
{
    VALUE flags = ep[VM_ENV_DATA_INDEX_FLAGS];
    if (LIKELY((flags & VM_ENV_FLAG_WB_REQUIRED) == 0)) {
	VM_STACK_ENV_WRITE(ep, index, v); /* write lvar directly */
    }
    else {
	vm_env_write_slowpath(ep, index, v);
    }
}
```

With these techniques, now RubyVM::Env objects are WB-protected without big performance impact.
Now, Proc, Binding objects are also WB-protected.

### Short summary

To make Env object wb-protected, I implemented a low-overhead WB technique.

(1) Move frame flags form `rb_control_frame_t::flags` to `ep[0]` (as a special local variable) and introduce VM_ENV_FLAG_ESCAPED to represent escaped Env.
(2) Introduce VM_ENV_FLAG_WB_REQUIRED to check necessity of WB protection which is tightly coupled with GC implementation.
(3) With this technique and other hacks, now RubyVM::Env, Proc and Binding objects are WB-protected.

# Evaluation

Introducing WBs for Env/Proc objects, we can improve the throughput of app_lc_fizzbuzz benchmark.
Also method and block invocations are faster.

several results:

```
                    trunk  modified
 app_lc_fizzbuzz   58.277    41.729 (sec) (x 1.397 faster)
 vm1_simplereturn*  0.660     0.638 (sec) (x 1.035 faster)
 vm1_yield*         0.738     0.650 (sec) (x 1.135 faster)
```

There are several slower programs.

```
                    trunk  modified
 app_pentomino     14.096    15.241 (sec) (x 0.925 faster == slow)
 vm1_lvar_set*      1.893     1.916 (sec) (x 0.988 faster == slow)
```

lvar_set tries to set local variables many times but not so big impact.
I'm not sure why pentomino puzzle is too slow.

All of benchmarks are here:
https://gist.github.com/ko1/c741cd4b2a5a5012364c0686703052b3

# Summary

I made a patch to solve issues (1) to (3).

https://github.com/ruby/ruby/compare/trunk...ko1:block_code

A patch is slightly big but it is difficult to separate into small part of code for me,
so I'll commit it soon at once, sorry.




-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>