I'm unsure about this.  I _hate_ the extra branches this adds;
and most of our benchmarks don't show an improvement.  But this
seems like an obvious experiment, so maybe somebody else would've
tried it if I didn't at least publish it here.


Atomic operations are expensive, so use thread-local counters and
only perform atomic operations when the local counters hit a
predefined limit (currently 16K).

This gives a ~12% speedup to the bm_so_count_words.rb benchmark
which does many small mallocs.  This pattern is common in some Ruby
scripts doing text processing, so maybe it is worth doing.

Unfortunately, this adds more branches, increases code size, and
hurts accuracy of GC accounting in multithreaded programs.  Some
benchmarks are slower as a result.

Full benchmark results in the full patch:

http://bogomips.org/ruby.git/patch?id=8271ec7b977
	git://80x24.org/ruby.git gc-lessatomic