Hi!

Do you have interst to visit Japan and discuss Japanese ruby committers?
If you have interst, I will ask someone to pay your travel fare.

Thanks,
Koichi

On 2016/07/18 12:24, vmakarov / redhat.com wrote:
> Issue #12589 has been reported by Vladimir Makarov.
> 
> ----------------------------------------
> Bug #12589: VM performance improvement proposal
> https://bugs.ruby-lang.org/issues/12589
> 
> * Author: Vladimir Makarov
> * Status: Open
> * Priority: Normal
> * Assignee: 
> * ruby -v: 
> * Backport: 2.1: UNKNOWN, 2.2: UNKNOWN, 2.3: UNKNOWN
> ----------------------------------------
>   Hello.  I'd like to start a big MRI project but I don't want to
> disrupt somebody else plans.  Therefore I'd like to have MRI
> developer's opinion on the proposed project or information if somebody
> is already working on an analogous project.
> 
>   Basically I want to improve overall MRI VM performance:
> 
>   * First of all, I'd like to change VM insns and move from
>     **stack-based** insns to **register transfer** ones.  The idea behind
>     it is to decrease VM dispatch overhead as approximately 2 times
>     less RTL insns are necessary than stack based insns for the same
>     program (for Ruby it is probably even less as a typical Ruby program
>     contains a lot of method calls and the arguments are passed through
>     the stack).
> 
>     But *decreasing memory traffic* is even more important advantage
>     of RTL insns as an RTL insn can address temporaries (stack) and
>     local variables in any combination.  So there is no necessity to
>     put an insn result on the stack and then move it to a local
>     variable or put variable value on the stack and then use it as an
>     insn operand.  Insns doing more also provide a bigger scope for C
>     compiler optimizations.
> 
>     The biggest changes will be in files compile.c and insns.def (they
>     will be basically rewritten).  **So the project is not a new VM
>     machine.  MRI VM is much more than these 2 files.**
> 
>     The disadvantage of RTL insns is a bigger insn memory footprint
>     (which can be upto 30% more) although as I wrote there are fewer
>     number of RTL insns.
> 
>     Another disadvantage of RTL insns *specifically* for Ruby is that
>     insns for call sequences will be basically the same stack based
>     ones but only bigger as they address the stack explicitly.
> 
>   * Secondly, I'd like to **combine some frequent insn sequences** into
>     bigger insns.  Again it decreases insn dispatch overhead and
>     memory traffic even more.  Also it permits to remove some type
>     checking.
> 
>     The first thing on my mind is a sequence of a compare insn and a
>     branch and using immediate operands besides temporary (stack) and
>     local variables.  Also it is not a trivial task for Ruby as the
>     compare can be implemented as a method.
> 
>   I already did some experiments.  RTL insns & combining insns permits
> to speed the following micro-benchmark in more 2 times:
> 
> ```
> i = 0
> while i<30_000_000 # benchmark loop 1
>   i += 1
> end
> ```
> 
> The generated RTL insns for the benchmark are
> 
> ```
> == disasm: #<ISeq:<main>@while.rb>======================================
> == catch table
> | catch type: break  st: 0007 ed: 0020 sp: 0000 cont: 0020
> | catch type: next   st: 0007 ed: 0020 sp: 0000 cont: 0005
> | catch type: redo   st: 0007 ed: 0020 sp: 0000 cont: 0007
> |------------------------------------------------------------------------
> local table (size: 2, temp: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
> [ 2] i
> 0000 set_local_val    2, 0                                            (   1)
> 0003 jump             13                                              (   2)
> 0005 jump             13
> 0007 plusi            <callcache>, 2, 2, 1, -1                        (   3)
> 0013 btlti            7, <callcache>, -1, 2, 30000000, -1             (   2)
> 0020 local_ret        2, 0                                            (   3)
> ```
> 
> In this experiment I ignored trace insns (that is another story) and a
> complication that a integer compare insn can be re-implemented as a
> Ruby method.  Insn bflti is combination of LT immediate compare and
> branch true.
> 
> A modification of fib benchmark is sped up in 1.35 times:
> 
> ```
> def fib_m n
>   if n < 1
>     1
>   else
>     fib_m(n-1) * fib_m(n-2)
>   end
> end
> 
> fib_m(40)
> ```
> 
> The RTL code of fib_m looks like
> 
> ```
> == disasm: #<ISeq:fib_m / fm.rb>==========================================
> local table (size: 2, temp: 3, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
> [ 2] n<Arg>
> 0000 bflti            10, <callcache>, -1, 2, 1, -1                   (   2)
> 0007 val_ret          1, 16
> 0010 minusi           <callcache>, -2, 2, 1, -2                       (   5)
> 0016 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -1
> 0020 minusi           <callcache>, -3, 2, 2, -3
> 0026 simple_call_self <callinfo!mid:fib_m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>, -2
> 0030 mult             <callcache>, -1, -1, -2, -1
> 0036 temp_ret         -1, 16
> ```
> 
> In reality, the improvement of most programs probably will be about
> 10%.  That is because of very dynamic nature of Ruby (a lot of calls,
> checks for redefinition of basic type operations, checking overflows
> to switch to GMP numbers).  For example, integer addition can not be
> less than about x86-64 17 insns out of the current 50 insns on the
> fast path.  So even if you make the rest (33) insns 2 times faster,
> the improvement will be only 30%.
> 
> A very important part of MRI performance improvement is to make calls
> fast because there are a lot of them in Ruby but as I read in some
> Koichi Sasada's presentations he pays a lot of attention to it.  So I
> don't want to touch it.
> 
>   * Thirdly.  I want to implement the insns as small inline functions
>     for future AOT compiler, of course, if the projects described
>     above are successful.  It will permit easy AOT generation of C code
>     which will be basically calls of the functions.
> 
>     I'd like to implement AOT compiler which will generate a Ruby
>     method code, call a C compiler to generate a binary shared code
>     and load it into MRI for subsequent calls.  The key is to minimize
>     the compilation time.  There are many approaches to do it but I
>     don't want to discuss it right now.
> 
>     C generation is easy and most portable implementation of AOT but
>     in future it is possible to use GCC JIT plugin or LLVM IR to
>     decrease overhead of C scanner/parser.
> 
>     C compiler will see a bigger scope (all method insns) to do
>     optimizations.  I think using AOT can give another 10%
>     improvement.  It is not that big again because of dynamic nature
>     of Ruby and any C compiler is not smart enough to figure out
>     aliasing for typical generated C program.
> 
>     The life with the performance point of view would be easy if Ruby
>     did not permit to redefine basic operations for basic types,
>     e.g. plus for integer.  In this case we could evaluate types of
>     operands and results using some data flow analysis and generate
>     faster specialized insns.  Still a gradual typing if it is
>     introduced in future versions of Ruby would help to generate such
>     faster insns.
> 
>   Again I wrote this proposal for discussion as I don't want to be in
> a position to compete with somebody else ongoing big project.  It
> might be counterproductive for MRI development.  Especially I don't
> want it because the project is big and long and probably will have a
> lot of tehcnical obstacles and have a possibilty to be a failure.
> 
> 
> 
> 


-- 
// SASADA Koichi at atdot dot net

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>