Issue #14370 has been updated by tenderlovemaking (Aaron Patterson).

File iseq_mark.diff added
File benchmark_methods.diff added
File bench.rb added

ko1 (Koichi Sasada) wrote:
> Cool.
> 
> Did you verify the references between your patch and current implementation?
>
> You can have two sets of referring objects from iseq->mark_ary and iseq (w/ your patch) and you can compare them to verify the reference.

Yes.  I had to add marking for catch table iseq as well as marking the instructions.  To verify the array contents I used the iterator code from this patch and printed out addresses from the instructions, then I compared those addresses to the addresses inside the mark array.  I added code to the GC to trace C stacks for allocations (I think I could add this to allocation_tracer gem):

  https://gist.github.com/tenderlove/4e74aebeb308cfa00306a186a26c0b8d

Using stack information plus mark_ary address and disasm addresses, I was able to make them match.

Unfortunately I cannot eliminate the mark array because we store coverage information, flip flop info, as well as result of `once` instruction (one time compile regex) in the mark array:

  https://github.com/ruby/ruby/blob/4b88d9c1c49b2ea33aa91995251b955aa8b90953/vm_insnhelper.c#L3292-L3293

I think "once" isn't very popular, so it's fine to allow the mark array grow for that.  Also the current mark_ary is small enough to be embedded, so I don't think there is a reason to remove it completely now.

Before my patch, the mark array contained:

1. Coverage info
2. Flip-flop counts
3. Original iseq
4. Result of `once` instruction
5. Literals
6. Child iseqs
7. Jump table entries

This patch removes 5, 6, and 7 from the array.  Literals and child iseqs are marked by walking the instructions, and I added a loop in the mark function to iterate and mark jump tables.

> > I figured this might be the case. This patch should make ISeq marking slower, but I wasn't sure if it would be mitigated by the fact that the ISeq objects get old. It seems like RGenGC solves the problem.
> 
> As you two said, it is not problem I also think (hopefully). But I want to know "how slow".

I've attached 2 patches.  One is an update so that `make test-all` passes, second one is a patch test test iterating over the instructions and time iteration.  I've also attached a benchmark I used to time instruction iteration.  Here is the output:

~~~
Warming up --------------------------------------
            107 fast    51.019k i/100ms
            987 fast     6.558k i/100ms
           9987 fast   681.000  i/100ms
          99987 fast    68.000  i/100ms
             98 slow    35.413k i/100ms
            998 slow     3.895k i/100ms
          10004 slow   397.000  i/100ms
          99998 slow    39.000  i/100ms
Calculating -------------------------------------
            107 fast    592.822k ( 4.3%) i/s -      2.959M in   5.002552s
            987 fast     67.417k ( 3.5%) i/s -    341.016k in   5.064936s
           9987 fast      6.671k ( 5.4%) i/s -     33.369k in   5.020232s
          99987 fast    653.789  ( 8.3%) i/s -      3.264k in   5.032978s
             98 slow    378.918k (10.4%) i/s -      1.912M in   5.113554s
            998 slow     39.021k ( 5.3%) i/s -    194.750k in   5.007117s
          10004 slow      3.894k ( 5.3%) i/s -     19.850k in   5.112539s
          99998 slow    387.217  ( 6.7%) i/s -      1.950k in   5.065411s
~~~

I made a fast test and a slow test.  The first number on each line is the size of iseq_encoded.  For example, "107 fast" means iseq_encoded is 107 long.  I expected the time to be linear with respect to the size of iseq_encoded, so it should get slower in relation to the length of iseq_encoded.  The iterator has to translate the encoded iseq back to the original iseq value so that it can get the operand count.  Translating the iseq is also a linear search:

  https://github.com/ruby/ruby/blob/36d91068ed9297cb792735f93f31d0bf186afeec/iseq.c#L117-L129

So I made a "best case" test, which is the fast test, and a "worst case" test, which is the slow test.  The "best case" test contains many `getlocal` instructions, because they are close to the start in insns.inc.  The "worst case" test contains lots of `putobject_INT2FIX_1_` instructions, which is near the end of the list.

For the fast test, the mean is 65,491,636 `VALUE` per second:

~~~
> mark_perf_fast$pointers * mark_perf_fast$ips
[1] 63431968 66540097 66624108 65370370
> mean(mark_perf_fast$pointers * mark_perf_fast$ips)
[1] 65491636
~~~

For the slow test, the mean is 38,439,253 `VALUE` per second:

~~~
> mark_perf_slow$pointers * mark_perf_slow$ips
[1] 37133984 38943026 38959108 38720894
> mean(mark_perf_slow$pointers * mark_perf_slow$ips)
[1] 38439253
~~~

Of course, this bench mark doesn't test instructions that have more operands, and `iseq` objects that don't have anything to mark will still have to scan `iseq_encoded` to find references.

As I said, I didn't see any performance changes with `make gcbench-rdoc`.  If this time is too slow, I think we can add some optimizations (like a bitmap, or optimization for iseq that have 0 references in `iseq_encoded`).  If there is some other benchmark you have in mind, please let me know!

----------------------------------------
Feature #14370: Directly mark instruction operands and avoid mark_ary usage on rb_iseq_constant_body
https://bugs.ruby-lang.org/issues/14370#change-69644

* Author: tenderlovemaking (Aaron Patterson)
* Status: Open
* Priority: Normal
* Assignee: ko1 (Koichi Sasada)
* Target version: 
----------------------------------------
Hi,

I've attached a patch that changes rb_iseq_mark to directly mark instruction operands rather than adding them to a mark array.  I observed a ~3% memory reduction by directly marking operands, and I didn't observe any difference in GC time.  To test memory usage, I used a basic Rails application, logged all malloc / free calls to a file, then wrote a script that would sum the live memory at each sample (each sample being a call to malloc).  I graphed these totals so that I could see the memory usage as malloc calls were made:

![memory usage graph](https://user-images.githubusercontent.com/3124/35020270-1b0ded20-fae0-11e7-9cbd-1d028a6c9484.png)

The red line is trunk, the blue line is trunk + the patch I've attached.  Since the X axis is sample number (not time), the blue line is not as long as the red line because the blue line calls `malloc` fewer times.  The Y axis in the graph is the total number of "live" bytes that have been allocated (all allocations minus their corresponding frees).  You can see from the graph that memory savings start adding up as more code gets loaded.

I was concerned that this patch might impact GC time, but `make gcbench-rdoc` didn't seem to show any significant difference in GC time between trunk and this patch.  If it turns out there is a performance impact, I think I could improve the time while still keeping memory usage low by generating a bitmap during iseq compilation.

There is a bit more information where I've been working, but I think I've summarized everything here.

  https://github.com/github/ruby/pull/39


---Files--------------------------------
iseq_mark.diff (6.28 KB)
iseq_mark.diff (6.28 KB)
iseq_mark.diff (7.26 KB)
benchmark_methods.diff (1.23 KB)
bench.rb (3.01 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>