On Wed, Jan 18, 2012 at 2:57 PM, Robert Klemme
<shortcutter / googlemail.com>wrote:

> On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele
> <peter / vandenabeele.com> wrote:
> > I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
> > there
> > would be a significant performance difference. Because, upon determining
> the
> > "bins" of uniq values, that is essentially a form of "sorting", which can
> > be O(n^2)
> > if not careful.
>
> There is no sorting going on.


I beg to differ ...

I would assume the only task with an order higher than O(n) is the
"binning".

That is
* finding the hash key for which the value needs to be incremented
* or finding the set to which the entry needs to be added

I would expect those searches to go dramatically faster if the "keys"
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).

But maybe, because I choose a _constant_ and low number of only
100 bins, this effect does not come out clearly ...

I changed the number of "bins" from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).

  https://gist.github.com/1634618

The initial "Ralph Shneiver" algorithm is least affected.


Some tests explicitly use a Hash others
> use Enumerable#group_by which also uses a Hash (you can see it from
> the return).
>
> > Questions:
> > ========
> >
> > Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
> > Could it be because the Hash is now "ordered" so it can do an efficient
> > algorithm when adding an entry to a bin?
>
> No.  Hashes in 1.9.* only maintain insertion order - nothing else.
>

Do they have a sort of b_tree index on the keys internally ?


>
> > Or is it object creation?
>
> The difference is more likely to do with the dramatically changed
> implementation of MRI between 1.8.* and 1.9.*.  Maybe there were also
> some Hash implementation tweaks but the ordering is certainly not
> responsible.


OK, understood.

>
> > Why are certain methods leading to non-linear behavior?
>
> I assume that this might have to do with memory allocation and GC.
> Choosing n=1 for repetition is too low to yield proper results IMHO
> since GC for the previous test might kick in etc.
>


"Ralph Shneiver:" => { result = Hash.new(0); ar.each { |x| result[x] += 1 }

<speculation>
Could it be that the reason that the "Ralph Shneiver" method remains
stable and linear in all tests, is that it only assigns max 100 entries in
the
hash and then does nothing more than _increase a number_ (an "in place"
action that does not assign new memory and triggers no GC).
</speculation>



> Btw. with Benchmark.bmbm you can have a warmup phase before the real
> measurement.
>

Thanks.


>
> I tried to tweak a bit but not much gain (~1% at most):
> https://gist.github.com/1633008
>
>
If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?

Thanks,

Peter