Peter,

Wednesday, January 18, 2012, 11:22:49 AM, you wrote:

PV> On Wed, Jan 18, 2012 at 2:57 PM, Robert Klemme
PV> <shortcutter / googlemail.com>wrote:

>> On Tue, Jan 17, 2012 at 12:08 PM, Peter Vandenabeele
>> <peter / vandenabeele.com> wrote:
>> > I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
>> > there
>> > would be a significant performance difference. Because, upon determining
>> the
>> > "bins" of uniq values, that is essentially a form of "sorting", which can
>> > be O(n^2)
>> > if not careful.
>>
>> There is no sorting going on.


PV> I beg to differ ...

I agree with you.  Obviously there is sorting going on.

PV> I would assume the only task with an order higher than O(n) is the
PV> "binning".

PV> That is
PV> * finding the hash key for which the value needs to be incremented
PV> * or finding the set to which the entry needs to be added

PV> I would expect those searches to go dramatically faster if the "keys"
PV> are indexed, so the algorithm can find the key in O(log(n)). So there
PV> the fundamental of a sorting algorithm is used (where to insert the
PV> new element, or add it to the bin if it already exists).

I'm looking at the C code for uniq.  I think I understand what it is doing.

The C code makes a call to ary_make_hash.  I can guess what it does.  I am wondering if there is an equivalent Ruby call.

uniq and what we are doing here are closely related.

The basic idea for uniq (in the case where there is no block given) appears to be
 1) Create a hash from an array (i.e ary_make_hash). I'm guessing that duplicates are ignored by the nature of creating a hash.
 2) For each element of the original array, attempt to delete that key from the created hash.  If you are able to do it, add that key to the uniq array.

Clever. I'm guessing doing it that way is faster than manipulation the output from Hash#to_a

PV> But maybe, because I choose a _constant_ and low number of only
PV> 100 bins, this effect does not come out clearly ...

PV> I changed the number of "bins" from 100 to 1_000_000 and for certain
PV> algorithms it has a larger impact then others (but none are dramatic
PV> in Ruby 1.9.3).

PV>   https://gist.github.com/1634618

PV> The initial "Ralph Shneiver" algorithm is least affected.

It's Shenlvar not Shnelver!   :-)


PV> Some tests explicitly use a Hash others
>> use Enumerable#group_by which also uses a Hash (you can see it from
>> the return).
>>
>> > Questions:
>> > ========
>> >
>> > Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
>> > Could it be because the Hash is now "ordered" so it can do an efficient
>> > algorithm when adding an entry to a bin?
>>
>> No.  Hashes in 1.9.* only maintain insertion order - nothing else.
>>

PV> Do they have a sort of b_tree index on the keys internally ?


>>
>> > Or is it object creation?
>>
>> The difference is more likely to do with the dramatically changed
>> implementation of MRI between 1.8.* and 1.9.*.  Maybe there were also
>> some Hash implementation tweaks but the ordering is certainly not
>> responsible.


PV> OK, understood.

>>
>> > Why are certain methods leading to non-linear behavior?
>>
>> I assume that this might have to do with memory allocation and GC.
>> Choosing n=1 for repetition is too low to yield proper results IMHO
>> since GC for the previous test might kick in etc.
>>




"Ralph Shneiver:" =>> { result = Hash.new(0); ar.each { |x| result[x] += 1 }

PV> <speculation>
PV> Could it be that the reason that the "Ralph Shneiver" method remains
PV> stable and linear in all tests, is that it only assigns max 100 entries in
PV> the
PV> hash and then does nothing more than _increase a number_ (an "in place"
PV> action that does not assign new memory and triggers no GC).
PV> </speculation>

I suspect your speculation is quite correct.

I am going to paste in stuff from another post for the reader's convenience.  I am quoting AJ

AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ>   b.report("Ralph Shneiver:") { n.times { a = [4,5,6,4,5,6,6,7];
AJ> result = Hash.new(0); a.each { |x|  result[x] += 1 }; result} }
AJ>   b.report("Sigurd:")        { n.times {
AJ> [4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res } } }
AJ>   b.report("Keinich #1")     { n.times { Hash[a.group_by{|n|n}.map{|k,
AJ> v|[k, v.size]}] } }
AJ>   b.report("Keinich #2")     { n.times {
AJ> Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
AJ>   b.report("Magnus Holm:")   { n.times {
AJ> [4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] += 1
AJ> } } }
AJ>   b.report("Abinoam #1:")    { n.times { Hash[
AJ> [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ]
AJ> } }
AJ> end

"Sigurd" creates a new object, "res", for each iteration of each

"Keinich #1"  appends stuff to a hash table value every time a new value is detected.  This has got to be more work.

"Keinich #2" is quite close to "Shnelvar" and the timings show this.

"Abinoam #1"  seems to be doing a lot of work. At least two sorts.  The first an explicit one and the second an implied one in the creation of a hash table.




>> Btw. with Benchmark.bmbm you can have a warmup phase before the real
>> measurement.
>>

PV> Thanks.


>>
>> I tried to tweak a bit but not much gain (~1% at most):
>> https://gist.github.com/1633008
>>
>>
PV> If I understand correctly, you tested for MRI Ruby 1.9.3 and find
PV> similar results ?

PV> Thanks,

PV> Peter



-- 
Best regards,
 Ralph                            mailto:ralphs / dos32.com