Issue #14370 has been updated by tenderlovemaking (Aaron Patterson). File iseq_mark.diff added File benchmark_methods.diff added File bench.rb added ko1 (Koichi Sasada) wrote: > Cool. > > Did you verify the references between your patch and current implementation? > > You can have two sets of referring objects from iseq->mark_ary and iseq (w/ your patch) and you can compare them to verify the reference. Yes. I had to add marking for catch table iseq as well as marking the instructions. To verify the array contents I used the iterator code from this patch and printed out addresses from the instructions, then I compared those addresses to the addresses inside the mark array. I added code to the GC to trace C stacks for allocations (I think I could add this to allocation_tracer gem): https://gist.github.com/tenderlove/4e74aebeb308cfa00306a186a26c0b8d Using stack information plus mark_ary address and disasm addresses, I was able to make them match. Unfortunately I cannot eliminate the mark array because we store coverage information, flip flop info, as well as result of `once` instruction (one time compile regex) in the mark array: https://github.com/ruby/ruby/blob/4b88d9c1c49b2ea33aa91995251b955aa8b90953/vm_insnhelper.c#L3292-L3293 I think "once" isn't very popular, so it's fine to allow the mark array grow for that. Also the current mark_ary is small enough to be embedded, so I don't think there is a reason to remove it completely now. Before my patch, the mark array contained: 1. Coverage info 2. Flip-flop counts 3. Original iseq 4. Result of `once` instruction 5. Literals 6. Child iseqs 7. Jump table entries This patch removes 5, 6, and 7 from the array. Literals and child iseqs are marked by walking the instructions, and I added a loop in the mark function to iterate and mark jump tables. > > I figured this might be the case. This patch should make ISeq marking slower, but I wasn't sure if it would be mitigated by the fact that the ISeq objects get old. It seems like RGenGC solves the problem. > > As you two said, it is not problem I also think (hopefully). But I want to know "how slow". I've attached 2 patches. One is an update so that `make test-all` passes, second one is a patch test test iterating over the instructions and time iteration. I've also attached a benchmark I used to time instruction iteration. Here is the output: ~~~ Warming up -------------------------------------- 107 fast 51.019k i/100ms 987 fast 6.558k i/100ms 9987 fast 681.000 i/100ms 99987 fast 68.000 i/100ms 98 slow 35.413k i/100ms 998 slow 3.895k i/100ms 10004 slow 397.000 i/100ms 99998 slow 39.000 i/100ms Calculating ------------------------------------- 107 fast 592.822k (¡Þ 4.3%) i/s - 2.959M in 5.002552s 987 fast 67.417k (¡Þ 3.5%) i/s - 341.016k in 5.064936s 9987 fast 6.671k (¡Þ 5.4%) i/s - 33.369k in 5.020232s 99987 fast 653.789 (¡Þ 8.3%) i/s - 3.264k in 5.032978s 98 slow 378.918k (¡Þ10.4%) i/s - 1.912M in 5.113554s 998 slow 39.021k (¡Þ 5.3%) i/s - 194.750k in 5.007117s 10004 slow 3.894k (¡Þ 5.3%) i/s - 19.850k in 5.112539s 99998 slow 387.217 (¡Þ 6.7%) i/s - 1.950k in 5.065411s ~~~ I made a fast test and a slow test. The first number on each line is the size of iseq_encoded. For example, "107 fast" means iseq_encoded is 107 long. I expected the time to be linear with respect to the size of iseq_encoded, so it should get slower in relation to the length of iseq_encoded. The iterator has to translate the encoded iseq back to the original iseq value so that it can get the operand count. Translating the iseq is also a linear search: https://github.com/ruby/ruby/blob/36d91068ed9297cb792735f93f31d0bf186afeec/iseq.c#L117-L129 So I made a "best case" test, which is the fast test, and a "worst case" test, which is the slow test. The "best case" test contains many `getlocal` instructions, because they are close to the start in insns.inc. The "worst case" test contains lots of `putobject_INT2FIX_1_` instructions, which is near the end of the list. For the fast test, the mean is 65,491,636 `VALUE` per second: ~~~ > mark_perf_fast$pointers * mark_perf_fast$ips [1] 63431968 66540097 66624108 65370370 > mean(mark_perf_fast$pointers * mark_perf_fast$ips) [1] 65491636 ~~~ For the slow test, the mean is 38,439,253 `VALUE` per second: ~~~ > mark_perf_slow$pointers * mark_perf_slow$ips [1] 37133984 38943026 38959108 38720894 > mean(mark_perf_slow$pointers * mark_perf_slow$ips) [1] 38439253 ~~~ Of course, this bench mark doesn't test instructions that have more operands, and `iseq` objects that don't have anything to mark will still have to scan `iseq_encoded` to find references. As I said, I didn't see any performance changes with `make gcbench-rdoc`. If this time is too slow, I think we can add some optimizations (like a bitmap, or optimization for iseq that have 0 references in `iseq_encoded`). If there is some other benchmark you have in mind, please let me know! ---------------------------------------- Feature #14370: Directly mark instruction operands and avoid mark_ary usage on rb_iseq_constant_body https://bugs.ruby-lang.org/issues/14370#change-69644 * Author: tenderlovemaking (Aaron Patterson) * Status: Open * Priority: Normal * Assignee: ko1 (Koichi Sasada) * Target version: ---------------------------------------- Hi, I've attached a patch that changes rb_iseq_mark to directly mark instruction operands rather than adding them to a mark array. I observed a ~3% memory reduction by directly marking operands, and I didn't observe any difference in GC time. To test memory usage, I used a basic Rails application, logged all malloc / free calls to a file, then wrote a script that would sum the live memory at each sample (each sample being a call to malloc). I graphed these totals so that I could see the memory usage as malloc calls were made:  The red line is trunk, the blue line is trunk + the patch I've attached. Since the X axis is sample number (not time), the blue line is not as long as the red line because the blue line calls `malloc` fewer times. The Y axis in the graph is the total number of "live" bytes that have been allocated (all allocations minus their corresponding frees). You can see from the graph that memory savings start adding up as more code gets loaded. I was concerned that this patch might impact GC time, but `make gcbench-rdoc` didn't seem to show any significant difference in GC time between trunk and this patch. If it turns out there is a performance impact, I think I could improve the time while still keeping memory usage low by generating a bitmap during iseq compilation. There is a bit more information where I've been working, but I think I've summarized everything here. https://github.com/github/ruby/pull/39 ---Files-------------------------------- iseq_mark.diff (6.28 KB) iseq_mark.diff (6.28 KB) iseq_mark.diff (7.26 KB) benchmark_methods.diff (1.23 KB) bench.rb (3.01 KB) -- https://bugs.ruby-lang.org/ Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>