< :前の番号
^ :番号順リスト
> :次の番号
P :前の記事(スレッド移動)
N :次の記事(スレッド移動)
|<:前のスレッド
>|:次のスレッド
^ :返事先
_:自分への返事
>:同じ返事先を持つ記事(前)
<:同じ返事先を持つ記事(後)
---:分割してスレッド表示、再表示
| :分割して(縦)スレッド表示、再表示
~ :スレッドのフレーム消去
.:インデックス
..:インデックスのインデックス
Issue #18045 has been updated by duerst (Martin D=FCrst).
I think this should really be discussed seriously at least once at a commit=
ter's meeting. I have added it to #18039. =
Using powers of 2 slot sizes is the "trusted and true" way of handling this=
, but may lead to some memory overhead. What about e.g. using powers of two=
, but keep a separate slot size for 40 bytes, so that RVALUEs don't incur o=
verhead?
Also, it may be worth at this time to do a pass through the various object =
types (TArray, TString,...) and figuring out cases where what's currently i=
s RVALUE can be dealt with in 32 bytes, or will benefit from being extended=
to 64 bytes, or can otherwise be improved or risk some additional overhead.
----------------------------------------
Feature #18045: Variable Width Allocation Phase II
https://bugs.ruby-lang.org/issues/18045#change-93114
* Author: peterzhu2118 (Peter Zhu)
* Status: Open
* Priority: Normal
* Assignee: peterzhu2118 (Peter Zhu)
----------------------------------------
# GitHub PR: https://github.com/ruby/ruby/pull/4680
# Feature description
Since merging the initial implementation in #17570, we've been working on i=
mproving the performance of the allocator (which was one of the major drawb=
acks in the initial implementation). We've chosen to return to using a free=
list-based allocator and added support for variable slot size pages in what=
we call "size pools". We're still keeping the `USE_RVARGC` compile-time fl=
ag that maintains current behaviour when it is not explicitly enabled.
## Summary
- We now use pages with different slot sizes in pools to speed up allocatio=
n. Objects will be allocated in the smallest slot size that fits the reques=
ted size. This brings us back to the freelist-based allocation algorithm an=
d significantly increases allocation performance.
- The heap growth algorithm has been changed to adapt to the growth pattern=
of the size pool. Heaps that rapidly allocate and deallocate are considere=
d "growth heaps" and are treated differently than heaps that are stable in =
size.
## Size pools
In this patch, we introduce a new structure into the GC called "size pools"=
. Each size pool contains an eden and tomb heap to hold the pages. The size=
pool determines the slot size of the pages in it. We've chosen powers of 2=
multiples of RVALUE size as our slot sizes. In other words, since the RVAL=
UE size is 40 bytes, size pool 0 will have pages with slot size 40B, 80B fo=
r size pool 1, 160B for size pool 2, 320B for size pool 3, etc. The reason =
we chose powers of 2 multiples of RVALUE is that powers of 2 are easy to wo=
rk with (e.g. we can use bit shifts) and using multiples of RVALUE means th=
at we do not waste space for the single RVALUE (40 byte) allocations. When =
VWA is not enabled, there is only one size pool.
## Heap growth algorithm
This patch changes heap growth when `USE_RVARGC` is turned on. Heap growth =
is now calculated after sweeping (rather than after marking). This is becau=
se we don't know the characteristics of a size pool after marking (we only =
know the number of marked slots vs. total number of pages). By keeping trac=
k of the number of free slots and swept slots of each size pool during swee=
ping, we know the exact number of live vs. free slots in each size pool. Th=
is allows us to grow the size pool according to its growth characteristics.=
For example, classes are allocated in the third size pool (160B slot size)=
. For most workloads, classes are allocated at boot and very rarely allocat=
ed afterwards. Through the data collected during sweeping, we can determine=
that this size pool is no longer growing and thus allow it to be very full.
## Lazy sweeping
At every sweeping step, we attempt to sweep a little bit of every size pool=
. If the size pool we're allocating into didn't yield a page during sweepin=
g and that size pool is not allowed to create new pages, then we must finis=
h sweeping by sweeping all remaining pages in all other size pools. This ma=
y cause some lazy sweeping steps to take longer than others, we can see the=
result of this in the worse p99 response time in railsbench.
# Benchmark setup
Benchmarking was done on a bare-metal Ubuntu machine on AWS. All benchmark =
results are using glibc by default, except when jemalloc is explicitly spec=
ified.
```
$ uname -a
Linux 5.8.0-1038-aws #40~20.04.1-Ubuntu SMP Thu Jun 17 13:25:28 UTC 2021 x8=
6_64 x86_64 x86_64 GNU/Linux
```
glibc version:
```
$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
```
jemalloc version:
```
$ apt list --installed | grep jemalloc
WARNING: apt does not have a stable CLI interface. Use with caution in scri=
pts.
libjemalloc-dev/focal,now 5.2.1-1ubuntu1 amd64 [installed]
libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed,automatic]
```
To measure memory usage over time, the [mstat tool](https://github.com/bpow=
ers/mstat) was used.
Master was benchmarked on commit 84fea8ee39249ff9e7a03c407e5d16ad93074f3e. =
The branch was rebased on top of the same commit.
Performance benchmarks of this branch without VWA turned on are included to=
make sure this patch does not introduce a performance regression compared =
to master.
# Benchmark results
## Summary
- We don't expect to see a significant performance improvement with this pa=
tch. In fact, our main goal of this patch is for Variable Width Allocation =
to have performance and memory usage comparable to master.
- railsbench:
- VWA uses about 1.02x more memory than master.
- When using jemalloc, VWA is about 1.027x faster than master.
- We see no (within margin of error) speedup when using glibc.
- In both glibc and jemalloc we see worse p99 response times described =
above in "Lazy sweeping" section. However, the p100 response times of VWA i=
s comparable to master.
- rdoc generation:
- VWA uses 1.13x less memory than master.
- liquid benchmarks:
- We see no significant performance changes here.
- optcarrot benchmark:
- We see no significant performance changes here.
## railsbench
For railsbench, we ran the [railsbench benchmark](https://github.com/k0kubu=
n/railsbench/blob/master/bin/bench). For both the performance and memory be=
nchmarks, 50 runs were conducted for each combination (branch + glibc, mast=
er + glibc, branch + jemalloc, master + jemalloc).
### glibc
For glibc, the RPS between the three are all within the margin of error. Ho=
wever, the p99 of the branch with VWA is significantly worse than master (3=
.2ms vs 1.8ms). This is due to the longer lazy sweeping steps discussed abo=
ve. However, this isn't a problem for p100 response times.
```
+-----------+-----------------+------------------+--------+
| | Branch (VWA on) | Branch (VWA off) | Master |
+-----------+-----------------+------------------+--------+
| RPS | 731.69 | 724.19 | 727.02 |
| p50 (ms) | 1.33 | 1.38 | 1.37 |
| p66 (ms) | 1.37 | 1.41 | 1.40 |
| p75 (ms) | 1.38 | 1.42 | 1.41 |
| p80 (ms) | 1.39 | 1.43 | 1.42 |
| p90 (ms) | 1.41 | 1.45 | 1.45 |
| p95 (ms) | 1.44 | 1.48 | 1.47 |
| p98 (ms) | 1.50 | 1.78 | 1.76 |
| p99 (ms) | 3.23 | 1.84 | 1.83 |
| p100 (ms) | 15.74 | 16.48 | 16.38 |
+-----------+-----------------+------------------+--------+
```
For memory usage, we see a slight increase in memory usage when using Varia=
ble Width Allocation compared to master.

Average max memory usage for VWA: 147.08 MB
Average max memory usage for master: 143.49 MB
VWA uses 1.025x more memory.
### jemalloc
For jemalloc, we see that the branch is 1.027x faster than master in RPS (s=
tandard deviation is about 3 rps). We once again see the worse p99 response=
times with VWA enabled.
```
+-----------+-----------------+------------------+--------+
| | Branch (VWA on) | Branch (VWA off) | Master |
+-----------+-----------------+------------------+--------+
| RPS | 756.06 | 741.26 | 736.31 |
| p50 (ms) | 1.30 | 1.34 | 1.34 |
| p66 (ms) | 1.32 | 1.36 | 1.37 |
| p75 (ms) | 1.33 | 1.37 | 1.39 |
| p80 (ms) | 1.34 | 1.38 | 1.39 |
| p90 (ms) | 1.36 | 1.40 | 1.41 |
| p95 (ms) | 1.38 | 1.42 | 1.44 |
| p98 (ms) | 1.42 | 1.71 | 1.70 |
| p99 (ms) | 2.74 | 1.78 | 1.78 |
| p100 (ms) | 13.42 | 16.80 | 16.45 |
+-----------+-----------------+------------------+--------+
```
Once again, we see a slight increase in memory usage in VWA compared to mas=
ter.

Average max memory usage for VWA: 145.47 MB
Average max memory usage for master: 142.65 MB
VWA uses 1.020x more memory.
## rdoc generation
For rdoc generation, we ran the following script for the branch and master =
to collect memory usage.
```
for x in $(seq 50)
do
sudo rm -rf .ext/rdoc; mstat -o mstat/branch_$x.tsv -freq 59 -- ruby --di=
sable-gems "./libexec/rdoc" --root "." --encoding=3DUTF-8 --all --ri --op "=
.ext/rdoc" --page-dir "./doc" --no-force-update "."
done
```
For rdoc generation, we see a decrease in memory usage when using Variable =
Width Allocation.
### glibc

Average max memory usage for VWA: 328.83 MB
Average max memory usage for master: 373.17 MB
VWA uses 1.13x less memory.
### jemalloc

Average max memory usage for VWA: 308.08 MB
Average max memory usage for master: 347.07 MB
VWA uses 1.13x less memory.
## Liquid benchmarks
For the liquid benchmarks, we ran the [liquid benchmark](https://github.com=
/Shopify/liquid/blob/master/performance/benchmark.rb) averaged over 5 runs =
each. We don't see any significant performance improvements or regressions =
here.
```
+----------------------+-----------------+------------------+--------+
| | Branch (VWA on) | Branch (VWA off) | Master |
+----------------------+-----------------+------------------+--------+
| Parse (i/s) | 39.60 | 39.56 | 39.45 |
| Render (i/s) | 127.69 | 126.91 | 127.06 |
| Parse & Render (i/s) | 28.42 | 28.40 | 28.30 |
+----------------------+-----------------+------------------+--------+
```
## optcarrot
For optcarrot, we ran the [optcarrot benchmark](https://github.com/mame/opt=
carrot/blob/master/bin/optcarrot-bench) averaged over 5 runs each. Once aga=
in, we don't see any significant performance improvements or regressions he=
re.
```
+-----+-----------------+------------------+--------+
| | Branch (VWA on) | Branch (VWA off) | Master |
+-----+-----------------+------------------+--------+
| fps | 44.54 | 45.02 | 44.15 |
+-----+-----------------+------------------+--------+
```
# Future plans
- Improve within size pool compaction support. The compaction only currentl=
y compacts the first size pool (i.e. size pool with 40B slots).
- Improve ractor support. The ractor cache currently only caches the first =
size pool (size pool with 40B slots). Any >40B allocations require locking =
the VM.
- Increase coverage of Variable Width Allocation. Our next candidates are a=
rrays and strings.
- Investigate ways to support resizing of objects within Variable Width All=
ocation.
- One idea is to support cross size pool compaction. So when an object =
requests to be resized it will be moved to the most optimal size pool.
- Use powers of 2 slot sizes in heap pages. We currently use powers of 2 mu=
ltiples of RVALUE size, but it would be easier and more efficient to use po=
wers of 2 slot sizes (i.e. 32B, 64B, 128B, etc.). However, this would mean =
that 24B will be wasted for objects that are 40B in size (occupy a single R=
VALUE) since they will be allocated into 64B slots. We believe that once mo=
re objects can take advantage of Variable Width Allocation, we will be able=
to shrink the size of RVALUE to 32B. As such, we only plan on investigatin=
g this once most, if not all, objects can be allocated through Variable Wid=
th Allocation.
-- =
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>