On Mon, Jan 11, 2010 at 8:27 AM, Paul Brannan <pbrannan / atdesk.com> wrote:
> For one application where we deployed both JRuby and Ruby 1.8, JRuby was
> able to significantly outperform Ruby 1.8 in terms of total cpu
> utilization and latency. However, JRuby with the default GC settings
> still periodically became unresponsive for about 1-2 seconds throughout
> the running time of of the application. Switching to an incrementalC
> made these gaps disappear, but by that point it was just an experiment
> for my own curiosity; the amount of effort to setup JRuby and tune the
> GC exceeded the time allocated to the project. In the end I went on-month vacation, and when I came back, the entire application had been
> rewritten in C++.

Some clarification of the pluggable GC stuff in the JVM (at least what
I know from Hotspot/OpenJDK...the other two major JVMs have their own
unique collectors).

* None of the major JVMs ever give heap space back to the system; if
they grow to 200MB, they'll never use less than 200MB. The
justification is that if you've run a process up to a certain size,
you're likely to need that size. It is possible to limit the growth
using a few simple flags that set minimum and maximum heap size.
* The GCs are not *live* swappable; you pick them at startup and
that's what you use for the lifetime of the process.
* There are many tunable settings, but most of those are also not
live; you set them at startup.
* Many of the settings can be left at defaults since Hotspot will try
to pick an appropriate GC and GC settings for your hardware (and will
adjust some GC behavior at runtime depending on how your application
behaves).

There are obviously way more tunables than most people ever need to
set, so generally running with the default "GC ergonomics" will allow
Hotspot to pick the best settings. It's not always perfect, of course,
so having the tunables is nice.

I thought it would be interesting to get some timings and GC output
for the three main JVM GCs, given the OP's benchmark:

The serial GC is the basic single-threaded collector.

~/projects/jruby  time jruby -J-XX:+UseSerialGC -e "(1..7).each{|n| a
= ['a']*(10**n); a.inspect;}"
real	0m4.538s
user	0m4.359s
sys	0m0.324s

The Parallel GC uses multiple threads to reduce pauses. My system has
2 cores, so I believe it defaults to 2 GC threads...both this and the
Concurrent Mark/Sweep collector would be more interesting on a system
with more cores.

~/projects/jruby  time jruby -J-XX:+UseParallelGC -e "(1..7).each{|n|
a = ['a']*(10**n); a.inspect;}"

real	0m5.489s
user	0m5.383s
sys	0m0.485s

~/projects/jruby  time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=1 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real	0m8.984s
user	0m5.793s
sys	0m0.605s

~/projects/jruby  time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=2 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real	0m5.251s
user	0m5.132s
sys	0m0.510s

~/projects/jruby  time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=3 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real	0m5.685s
user	0m5.146s
sys	0m0.658s

The Concurrent Mark-Sweep collector does the mark and sweep phases of
GC concurrent with program execution, but still stops the world for
compacting:

~/projects/jruby  time jruby -J-XX:+UseConcMarkSweepGC -e
"(1..7).each{|n| a = ['a']*(10**n); a.inspect;}"

real	0m4.860s
user	0m5.077s
sys	0m0.349s

The G1 "Garbage First" collector is a semispace collector that does
not have generations. It is intended to eventually replace the CMS GC
and provide more predictable pauses (CMS can occasionally cause really
long full GC pauses under high load):

~/projects/jruby  time jruby -J-XX:+UnlockExperimentalVMOptions
-J-XX:+UseG1GC -e "(1..7).each{|n| a = ['a']*(10**n); a.inspect;}"

real	0m9.098s
user	0m9.709s
sys	0m1.076s

Here's a nice article on tuning GC for Java 5 (which is EOL, but the
article applies well to Java 6+):

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

It might help guide potential solutions for MRI's GC.

- Charlie