Hmm interesting.
So I was looking at it from the single threaded perspective so
obviously missed some subtle implications.

If I understand correctly, the problem is that
1) If you have a large stacked thread "full of garbage" then this
garbage will be copied into the stack of a small stack after context
switch if it grows.
2) If a single thread creates a very "dirty" stack then goes into a
deep nested loop [ex: going to sleep forever within a very nested
call], it will not free the invalid references until it comes out of
that deep stack later.

I suppose we can operate under the assumption that when the program
starts, the extent of the stack is "clear" of bad references.

A few tricks up our sleeve:
We can do a stack cleaning around the time of a context switch:
We can clear the difference in size between the stacks after each
context switch.
We could clear that difference PLUS re-clear the "cleared once" area
below the stack, after each context switch.

Or perhaps do the "clear at most once" trick only if rb_thread_alone,
though I think the above would already do that.

So anyway we could basically reset the "already cleared" markers once
per context switch, instead of once per GC, and re-clear that stacks
damage.  Would that help?
In reality I'm not sure if these would be necessary.  How can we tell
how much is necessary?

Old notes:

So let's then keep two values, per thread.  One being the top of a
"clean section" the other the bottom of the "clean section"  [already
swept section].

Make this "clean section" grow as possible [check it every CHECK_INT,
if you're above it, grow it, if you're below it, reset it to start
below you, etc.].  So we have track of, per thread, a growing cleaned
area.

Now when you context switch, if you switch from a large stack to a
shorter stack, clean the difference, plus the "dirty but clean now"
section--clean it again.  Reset the pointers.

I guess just try it out :)  Or I might get around to it eventually.

Comments inline:

> My bogus2 benchmark switches between one thread having a very deep stack and
> another with a shallow stack.  It's the worst conceivable case of stack
> thrashing.  It runs about 15% faster if I disable only the clearing of the
> stack.

I wonder if that's what causes the micro-benchmark slowdowns [what are
they like 5%?]  What about disabling the depth checker, too? What's
its impact?

> When whatever transient ghost references remain, change value, GC will
> eventually collect the objects to which they referred.  Correct?

Yeah

> 2)  GC is not triggered by any thread's particular activities.  It may be
> that a given thread, whose stack has become full of ghost references due to
> deferred stack clearing, stops running for long periods of time.  Or, that a
> such a thread just never happens to be running when a GC is triggered.

True if a thread "doesn't run at all" between GC's then it won't clear
its stack until...it runs again at some point :)
A thread basically gets a window of 1 GC to create as much trash as it
wants, and, if it ceases running, retains that much trash.

-=r