pat eyler wrote:
> On Fri, Jan 9, 2009 at 6:20 PM, Clifford Heath <no / spam.please.net> wrote:
>> Robert Klemme wrote:
>>> I would have guessed that gsub! is fastest
>> It still might be - the benchmark doesn't run long enough to
>> compare the GC overhead of making dozens of little strings
>> that get used once each.
> Is this better?

No. Before I elaborate, I'm not saying I don't believe the result.
I'm saying your benchmark figure won't accurately represent the
cost in a long-running application.

All versions create an initial million strings - perhaps 40 MB.
Your computer has what, 2GB of memory? At what point does the GC run?

The gsub version creates a million extra strings.
The gsub! version creates, perhaps no extra strings, perhaps a million.
The split version creates *six* million extra strings (one per word and one from join)
The squeeze version creates two million (from squeeze, and from strip).

Now depending on whether the string in your real-life application
is an HTML document with a thousand white-space runs, how many
extra strings do the respective versions take? split makes a *billion*.

A benchmark environment must consider that the code being tested
will run in the context of an application where there are many
other objects created by all the *other* code - perhaps a thousand
times as many objects. At some point, the garbage collector is
likely to run. That takes time, and the time should be part of the
benchmark.

Try squeezing said HTML document a million times, and run the GC
inside each benchmark timer (after the n.times loop). Then I'll
be happy ;-).

Clifford Heath.