Hi,

I had planned to work a bit more on this before posting but since the
discussion on RubyInRuby have started I think its better to post. It 
overlaps somewhat with previous posts. Sorry for that.

Below you'll find a proposal for a project to build a Ruby lib with
components for building Ruby virtual machines. Its a bit long and there's
probably lots of muddy thinking in there so I'd appreciate any comments
you might have.

Regards,

Robert


RubyVM - A library of virtual machine components for executing Ruby
programs written in Ruby
*******************************************************************

Robert Feldt, feldt / ce.chalmers.se, version 0.2, 2001-01-08

This text along with links/references related to it can be found at
www.ce.chalmers.se/~feldt/ruby/ideas/rubyvm.

NOTE!
-----
I make a lot of assumptions and guesses on the current and expected
future state of matz Ruby interpreter. I haven't checked them with
him so they may be outwright wrong/false. I apologize if that is so.

Also note that this text should *NOT* in any way be seen as a
criticism against matz and his future plans for Ruby. matz work is
amazing. Keep it up man!

Abstract
--------
We argue for a project to develop a modular and OO library of virtual
machine components for executing Ruby programs in Ruby. The components 
from the library can be assembled to construct custom virtual machines 
suited to particular needs/situations where Ruby programs needs to be 
executed (ie. different platforms with different constraints/trade-offs).

Introduction/Background
-----------------------
Ruby is a wonderful programming language that we would like to see
live and prosper. The current implementation of Ruby (matz Ruby
interpreter, herein called MRI) is mainly written in C and has evolved
for more than 7 years. In ruby-talk:8693, matz mentions his long-term
goal to replace MRI with a bytecode interpreter and highlights some
issues/drawbacks with the current one:

I1. Performance - MRI is currently based on a recursive eval function;
  matz thinks this is the bottleneck of performance.

I2. Maintainability - matz worked on it and added features for more than
  7 years; it's quite complicated and only matz himself (?)
  understands all pieces and how they glue together.

I3. Executable-Size - matz mentions MicroRuby (I'm not sure I
  understands exactly what he refers to so I'm guessing!) and there
  has been some discussion on how to construct small/minimal
  interpreters for memory-limited platforms (PDA's etc).

To this list we would like to add (well they are all related so maybe
I should've collapsed them to fewer ones...):

I4. Learnability - Apart from being hard to maintain the complexity of
  MRI makes it difficult for newcomers to go in and check out what is
  really happening in the interpreter. You have to know C and
  understand the structure and assumptions made in MRI.

I5. Development effort/Fun - We like Ruby for a reason, eh? It helps us
  "write better code, be more productive, and enjoy programming more"
  (pickaxe book, Thomas&Hunt) in less time. So why should we implement
  our Ruby language tools in C? (Well, there is one obvious answer but
  we'll get to that later...)

I6. Flexibility - (Related to the previous one) IMHO, Ruby code is also
  better because you're more flexible. By having a well-thought out OO
  design you can plug-and-play various configurations, develop your
  own replacements etc. This can be beneficial both directly and
  indirectly (knowledge from experiments can be used in enhancing
  MRI).

I7. Codebase-Size - (Related to the previous ones) IMHO, Ruby code
  tend to be smaller and more compact than C ditto; thus reducing
  complexity and lowering the level of understanding/learning about
  the interpreter.

From matz description of the next-generation interpreter (hereafter
called MNG) it seems that his main goal is to address issues I1 and
I2. He intends to design a bytecode format for Ruby, implement an
interpreter based on it and in the process clean the interpreter up so
that it is better structured and less complex.

We propose a different approach that partly addresses the same goals:
develop a full Ruby virtual machine in Ruby itself.

RubyVM - Ruby's Squeak
----------------------
The basic idea in this text is that we should do something
similar to the Smalltalk environment Squeak. "Squeak
is an open, highly-portable Smalltalk-80 implementation whose virtual
machine is written entirely in Smalltalk, making it easy to debug,
analyze, and change. To achieve practical performance, a translator
produces an equivalent C program whose performance is comparable to
commercial Smalltalks." (From www.squeak.org)

Squeak has the following main components:
  * Smalltalk interpreter written in a subset of Smalltalk (hereafter
  we call it mSt as in micro Smalltalk) that can be easily compiled to C
  * Compiler compiling mSt to C written in "full" Smalltalk
  * Smalltalk libs written in "full" Smalltalk

mSt is a subset of Smalltalk that maps directly onto C
constructs. It excludes blocks, message sending and even objects.
Methods of the interpreter classes are mapped to C functions and
instance variables are mapped to global vars. The translation to C
yields a speed-up of 450 compared to running the VM in a Smalltalk
system. Of this a factor of 3.4 can be attributed to inlining (prior
to applying a possibly inlining C compiler that is...).

Once bootstrapped with other Smalltalk environment/interpreter it has
compiled itself. More details on Squeak can be found in [Ing96].

A final quote from www.squeak.org on why Squeak is important: ""How is
Squeak important? Squeak extends the fundamental Smalltalk philosophy
of complete openness -- where everything is available to see,
understand, modify, and extend for whatever purpose -- to include even
the VM. It is a genuine, complete, compact, efficient Smalltalk-80
environment (*not* a toy). It is not specialized for any particular
hardware/OS platform. Porting is easy -- you are not fighting
entrenched platform/OS dependencies to move to a new system or
configuration."

In this text, we propose a Ruby project along the lines of Squeak
called RubyVM.

Overview of RubyVM
------------------
Based on Squeak, we think the basic structure of RubyVM should be
something like this:
* mRb = MicroRubyLanguage, a subset of Ruby with constructs that can
be compiled to static/fast C.
* mrbc = mRb compiler written in (full) Ruby. Compiles mRb programs to C.
* RubyVM-Core = Ruby language/interpreter core components written in mRb.
* RubyStdlib in (full) Ruby = Implementing the standard library in Ruby. 
(This should not be a part of RubyVM since MNG
will also have use for it mainly to address I4-I7 above). Some kind of
incremental/bootstrapping would probably be a good thing, ie. describe
dependencies between different classes in RubyStdLib and implement them in 
an order honoring the dependecies.

As an add-on it would probably be nice to have wrappers for MNG's
functionality into the OO design/classes of RubyVM-Core. Thus you can
plug-and-play with components both from RubyVM and MNG.

Alternatives to RubyVM-Core written in mRb
------------------------------------------
1. RubyVM-Core in C (or even assembler). This is basically MNG,
possibly with some differences in design.
    Pros: 
     * Performance
     * Easier to adhere to extension C API
    Con:
     * Issues I4-I7 above!?
2. RubyVM-Core in (pure) Ruby.
    Pro:
     * Use of full Ruby => even easier to code and understand
    Con:
     * Performance
     * Extension C API?

1 is MNG and is developed by matz anyway. 2 might be a good idea if we
get a copiler from full Ruby to fast, native code. However, developing
such a compiler will likely be more difficult than the other
approaches together and if we could do it this whole discussion
might be superflouous! Over time, though, we expect RubyVM to evolve
together with a OO, compiler construction kit so that they (in the
bright, but oh so distant future ;-)) will support the full spectrum of
Ruby executing modes from "app-plus-RubyVM-compiled-to-native-code" to
compact interpreter-style RubyVM's for memory-limited PDA's.

Alternative to compiling mRb to C
---------------------------------
Instead of compiling to C we could compile to native code directly. We
can probably come up with a nice OO design where code generators for
different machines can be plugged in but it will probably be more
difficult to implement the many different optimizations of modern C
compilers.

Misc notes on RubyVM-Core
-------------------------
Design of RubyVM-Core should probably be independent of the actual
format of the parsed/compiled Ruby code so that we can experiment with
different combinations of "tree code" (AST ?! currently used in MRI),
byte code, threaded code, native code as well as Just-in-time/dynamic
and adaptive compilation etc.

We imagine an important part of RubyVM-core will be to make it
possible to plug-in different GC's since the req's on them will vary a
lot for different platforms (real-time GC etc). Important to get the
interface between ObjectMemory parts not directly related to GC and
the GC classes right...

Design of RubyVM-Core
---------------------

{Go through the current interpreter, extract the core and design
RubyVM-core based on that, the description of the Smalltalk-80 VM, the
VM+adaptive compiler for Self, and the latest Java VM stuff. Anyone
care to help? :-)}

Compiler - Compiles Ruby source code into CompiledMethod objects.
	   For example, Ruby parsing is in here.
Interpreter - Interpretes CompiledMethods.
ObjectMemory - Handles allocation and reclamation of objects in memory.
	       For example, GC is in here.

Notes on mRb and mrbc
---------------------
There will need to be some interface to the OS and it will likely have
to be written in C. A potential way to clean this up would be to allow
interpreter classes to be written with C code inside the Ruby code,
in a way similar to Perl's Inline module [ref?]. Smalltalk/X also has
this. Probably easy to add when doing mrbc since it can simply copy
inlined C code to its output...

Problems with proposed approach
-------------------------------
* Supporting existing C API: If existing extensions should be useable
  with RubyVM we need to support the existing C API (in
  ruby.h). Unclear if it is possible if the designs of MMRI/MNG and
  RubyVM are very different.

* How to support threading in VM-Core? Can aspect-oriented programming
  be of any help, ie. specifying the critical VM code without having
  to alter the code itself?

* If we restrict mRb to much we might as well write RubyVM in C!?

* Will anyone really want to write this stuff? Well, I'd like to be a
  part of it but anyone else?

Proposed "principles"/"goals" when developing RubyVM
----------------------------------------------------
* Avoid premature optimization - Do things as simple and as high-level
as possible at first. Then optimize / go to lower-levels if
profiling shows performance bottlenecks.

* Modularity driven to its extreme. RubyVM should describe a family of
  Ruby VM's. Everything should be pluggable. We want to experiment and
  tune VM's so that we can even construct an optimal VM for running a
  particular Ruby script!

* Ruby execution semantics same as MRI.

{add more here..., clean up etc}

Final notes
-----------
The main problem with this approach seems to be that the performance
of resulting VM's may not be satisfactory. We think this objection will
be of minor importance because:
1. The diversity of execution platforms calls for a diversity of
execution models/VM design decisions. One, hand-optimized,
C-implemented Ruby interpreter may not be the optimal choice for all
platforms.
2. New VM techniques can be more easily, faster and more portably
implemented in Ruby than in C so in the long-term flexibility might
win over performance.

We should probably learn from the fast Self and Smalltalk
implementations around. For example, [Wol99] mentions the key
VM/compiler implementation techniques for good performance of OO langs:

* Adaptive optimization based on feedback. Shouldn't devote attention
to compiling all parts of a program when all real programs spend most
of their time in a small part of the program. (RF comment: I think
I've read elsewhere that the Self compiler uses profiling to get type
information (feedback) so that dynamic dispatch can be avoided most of
the time)
* Efficient object representation.
* Aggressive inlining and fast sends.
* High-speed object allocation, access and reclamation.

and the stuff lacking in the Self VM that made implementing Smalltalk and
Java
on top of it more difficult (might be some lessons for us in here also
/RF):

* Blocks with arbitrary life-time (Smalltalk)
* unboxed floats and long integers (Self only supports 30-bit integers)
* arbitrary control flow within a method (Java) (gotos for multiple
branching) 

We should probably learn from IBM's Jalapeno since it is also a XinX
system
(but X is Java so not as close to Ruby as Smalltalk). Jalapeno differs
from Squeak by using their own native code generators (instead of
generating C). Yet another VM to learn from (for small architectures)
might be Dis the VM for Inferno (from Lucent ?).

References
----------
[Ing96] Back to the Future - The Story of Squeak, A Practical Smalltalk
Written in Itself by Dan Ingalls Ted Kaehler John Maloney Scott
Wallace Alan Kay at Apple Computer while doing this work, now at Walt
Disney Imagineering 1401 Flower Street P.O. Box 25020 Glendale, CA
91221 dani / wdi.disney.com,
ftp://st.cs.uiuc.edu/Smalltalk/Squeak/docs/OOPSLA.Squeak.html.

[Jec99] Jecel M. de Assumpcao Jr., Incremental Porting of the Self/R
Virtual
Machine, Merlin Computers, jecel / lsi.usp.br, position paper at
OOPSLA'99 workshop on VM's.

[Wol99] Mario Wolczko, Ole Agesen, and David Ungar. "Towards a
Universal Implementation Substrate for Object-Oriented Languages", Sun
Microsystems Laboratories, OOPSLA'99 workshop on VM's.