I've repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.
To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I'm handing it an input file that looks like this
<?xml version="1.0" ?>
<multistatus>
<response>
<dsref>0</dsref>
</response>
<response>
<dsref>1</dsref>
</response>
...
</multistatus>
I'm varying the number of <response> elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).
Here's the profile data for strip over all values of n:
time/call % total runtime
min max min max
Ruby 1.8.0 .01 .02 .82 2.48
Ruby 1.8.1 .10 1.21 5.08 29.04
In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.
I don't suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn't
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear--unless the buffer is growing.)
I hope this is enough for someone to diagnose and fix the problem. I'll
send the data to anyone who wants it.
Steve