Thanks for the info.  It seems my patch changes object allocation counts
enough to throw GC off for this benchmark.  Having more/less threads
or other objects changes the effect.

But in general, thread scheduler benchmarks with many concurrenty
threads are not very reliable in my experience (the mutex benchmarks
are notoriously unreliable for me).

I think your original bm_thread_create_join is important and relevant
since only one thread is running, but scheduling hundreds/thousands of
threads becomes highly unpredictable with the GVL (GVL fairness improved
greatly in 1.9.3).

And don't worry about not knowing C well.  I only pretended to know C
in the beginning.  After several years, I realized I wasn't pretending :)