Issue #16837 has been updated by shyouhei (Shyouhei Urabe).


Some analysis of the slowdown.

Looking at the generated binary and `perf` output, the slowdown is because some functions are not inlined.  Might depend on compilers, but for me `rb_array_len()` is one of such victim:

```
zsh % gdb -batch -ex 'file miniruby' -ex 'disassemble rb_array_len'
Dump of assembler code for function rb_array_len:
   0x0000000000295540 <+0>:     push   %rbx
   0x0000000000295541 <+1>:     mov    %rdi,%rbx
   0x0000000000295544 <+4>:     test   $0x7,%bl
   0x0000000000295547 <+7>:     jne    0x2955be <rb_array_len+126>
   0x0000000000295549 <+9>:     mov    %rbx,%rax
   0x000000000029554c <+12>:    and    $0xfffffffffffffff7,%rax
   0x0000000000295550 <+16>:    je     0x2955be <rb_array_len+126>
   0x0000000000295552 <+18>:    mov    (%rbx),%rax
   0x0000000000295555 <+21>:    mov    %eax,%edx
   0x0000000000295557 <+23>:    and    $0x1f,%edx
   0x000000000029555a <+26>:    mov    $0x7,%ecx
   0x000000000029555f <+31>:    cmp    $0x7,%edx
   0x0000000000295562 <+34>:    jne    0x295585 <rb_array_len+69>
   0x0000000000295564 <+36>:    test   $0x2000,%eax
   0x0000000000295569 <+41>:    jne    0x295571 <rb_array_len+49>
   0x000000000029556b <+43>:    mov    0x10(%rbx),%rax
   0x000000000029556f <+47>:    pop    %rbx
   0x0000000000295570 <+48>:    retq
   0x0000000000295571 <+49>:    cmp    $0x7,%ecx
   0x0000000000295574 <+52>:    jne    0x2955a2 <rb_array_len+98>
   0x0000000000295576 <+54>:    test   $0x2000,%eax
   0x000000000029557b <+59>:    je     0x2955ea <rb_array_len+170>
   0x000000000029557d <+61>:    shr    $0xf,%eax
   0x0000000000295580 <+64>:    and    $0x3,%eax
   0x0000000000295583 <+67>:    pop    %rbx
   0x0000000000295584 <+68>:    retq
   0x0000000000295585 <+69>:    mov    %rbx,%rdi
   0x0000000000295588 <+72>:    mov    $0x7,%esi
   0x000000000029558d <+77>:    callq  0xcaea2 <rb_check_type>
   0x0000000000295592 <+82>:    mov    (%rbx),%rax
   0x0000000000295595 <+85>:    mov    %eax,%ecx
   0x0000000000295597 <+87>:    and    $0x1f,%ecx
   0x000000000029559a <+90>:    cmp    $0x1b,%rcx
   0x000000000029559e <+94>:    jne    0x295564 <rb_array_len+36>
   0x00000000002955a0 <+96>:    jmp    0x2955cb <rb_array_len+139>
   0x00000000002955a2 <+98>:    mov    %rbx,%rdi
   0x00000000002955a5 <+101>:   mov    $0x7,%esi
   0x00000000002955aa <+106>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955af <+111>:   mov    (%rbx),%rax
   0x00000000002955b2 <+114>:   mov    %eax,%ecx
   0x00000000002955b4 <+116>:   and    $0x1f,%ecx
   0x00000000002955b7 <+119>:   cmp    $0x1b,%ecx
   0x00000000002955ba <+122>:   jne    0x295576 <rb_array_len+54>
   0x00000000002955bc <+124>:   jmp    0x2955cb <rb_array_len+139>
   0x00000000002955be <+126>:   mov    %rbx,%rdi
   0x00000000002955c1 <+129>:   mov    $0x7,%esi
   0x00000000002955c6 <+134>:   callq  0xcaea2 <rb_check_type>
   0x00000000002955cb <+139>:   lea    0x142fe(%rip),%rdi        # 0x2a98d0
   0x00000000002955d2 <+146>:   lea    0x1432f(%rip),%rdx        # 0x2a9908
   0x00000000002955d9 <+153>:   lea    0x14337(%rip),%rcx        # 0x2a9917
   0x00000000002955e0 <+160>:   mov    $0xea,%esi
   0x00000000002955e5 <+165>:   callq  0xcad86 <rb_assert_failure>
   0x00000000002955ea <+170>:   lea    0x14338(%rip),%rdi        # 0x2a9929
   0x00000000002955f1 <+177>:   lea    0x1436d(%rip),%rdx        # 0x2a9965
   0x00000000002955f8 <+184>:   lea    0x14377(%rip),%rcx        # 0x2a9976
   0x00000000002955ff <+191>:   mov    $0x79,%esi
   0x0000000000295604 <+196>:   callq  0xcad86 <rb_assert_failure>
End of assembler dump.
```

Here, assertions practically never fail.  This means jumps are 100% predicted (almost no-op).  They don't slow things.  The problem is those unreachable branches.  If you can read the assembly you see almost 2/3 of the above function just never reach.  They blow the generated binary up significantly.  `rb_array_len` is thus now considered too big to be inlined, to my compiler at least.

An obvious ad-hoc remedy is to supply `__attribute__((__always_inline__))` for everything.  But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.

----------------------------------------
Feature #16837: Can we make Ruby 3.0 as fast as Ruby 2.7 with the new assertions?
https://bugs.ruby-lang.org/issues/16837#change-85423

* Author: k0kubun (Takashi Kokubun)
* Status: Open
* Priority: Normal
----------------------------------------
## Problem
How can we make Ruby 3.0 as fast as (or faster than) Ruby 2.7?

### Background
* Split ruby.h https://github.com/ruby/ruby/pull/2991 added some new assertions
* While it has been helpful for revealing various bugs, it also made some Ruby programs notably slow, especially Optcarrot https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html

## Possible approaches
I have no strong preference yet. Here are some random ideas:

* Optimize the assertion code somehow
* Enable the new assertions only on CIs, at least ones in hot spots
  * Not sure which places have large impact on Optcarrot yet
* Make some other not-so-important assertions CI-only to offset the impact from new ones
* Provide .so for an assertion-enabled mode? (ko1's idea)

I hope people will comment more ideas in this ticket.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>