On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele
<peter / vandenabeele.com> wrote:
> I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
> there
> would be a significant performance difference. Because, upon determining the
> "bins" of uniq values, that is essentially a form of "sorting", which can
> be O(n^2)
> if not careful.

There is no sorting going on.  Some tests explicitly use a Hash others
use Enumerable#group_by which also uses a Hash (you can see it from
the return).

> Questions:
> ========
>
> Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
> Could it be because the Hash is now "ordered" so it can do an efficient
> algorithm when adding an entry to a bin?

No.  Hashes in 1.9.* only maintain insertion order - nothing else.

> Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*.  Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

> Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

I tried to tweak a bit but not much gain (~1% at most):
https://gist.github.com/1633008

Cheers

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/