On Wed, 19 Oct 2005, Ryan Leavengood wrote:

> Here are the latest results, with Park's solution and Sam's first solution
> added to the mix:

a couple of observations: zach's method destroys the array at each iteration
and is unique in this respect.  to be fair a 'dup' must be added to his
approach.  you arrays are composed in a quite regular fashion which give the
algoritm's that favour linear search a huge advantage - a random distribution
of dups makes the task much harder.  i implemented changes for the above and
now:

     harp:~ > ruby a.rb
                     user     system      total        real
     david       5.330000   0.000000   5.330000 (  5.350960)
     jeff        0.370000   0.000000   0.370000 (  0.373027)
     jegII       0.320000   0.010000   0.330000 (  0.324357)
     markvh      0.610000   0.000000   0.610000 (  0.606635)
     martin      0.300000   0.000000   0.300000 (  0.308640)
     paolo       0.540000   0.000000   0.540000 (  0.537689)
     park        0.400000   0.000000   0.400000 (  0.403531)
     paul        0.660000   0.000000   0.660000 (  0.662007)
     robert      0.640000   0.000000   0.640000 (  0.642736)
     samk1       0.410000   0.000000   0.410000 (  0.418733)
     samk2       4.730000   0.000000   4.730000 (  5.043174)
     simonk      1.060000   0.000000   1.060000 (  1.052031)
     simons      1.150000   0.000000   1.150000 (  1.160381)
     zach        0.640000   0.000000   0.640000 (  0.641282)
                     user     system      total        real
     david      12.600000   0.000000  12.600000 ( 12.637510)
     jeff        0.200000   0.000000   0.200000 (  0.203639)
     jegII       0.010000   0.000000   0.010000 (  0.007662)
     markvh      0.880000   0.000000   0.880000 (  0.872928)
     martin      0.010000   0.000000   0.010000 (  0.007404)
     paolo       0.030000   0.000000   0.030000 (  0.028461)
     park        0.010000   0.000000   0.010000 (  0.009939)
     paul        0.020000   0.000000   0.020000 (  0.019959)
     robert      0.010000   0.010000   0.020000 (  0.015718)
     samk1       0.010000   0.000000   0.010000 (  0.009769)
     samk2      10.050000   0.000000  10.050000 ( 10.088609)
     simonk      1.950000   0.000000   1.950000 (  1.949996)
     simons      1.420000   0.000000   1.420000 (  1.415801)
     zach        0.890000   0.000000   0.890000 (  0.890465)

jegII, martin, and samk1 are two orders of magnitude faster than the delete_if
(destroys 'a') approach!

also, your concept of 'big' doesn't seem that, er, big - running with larger
numbers is also revealing - i started a process with

   harp:~ > ruby a.rb 424 42424  (factor of 10 increase in size)

1 hour ago and it's still not complete so many of the algorithms obviously
scale horribly.  i'll post the results when (if) it finishes.


here's the code:

   harp:~ > cat a.rb
   require 'benchmark'

   def run_benchmark(bx, name, iterations, &block)
     GC.start
     fork{ bx.report(name) { iterations.times &block }; exit! }
     Process::wait
   end

   def dupd_array size, min_percent_dups
     a = Array::new(size).map{ rand size }
     n = ((size * min_percent_dups) / 2).ceil
     n.times{ i = rand(size); a[i] = a[[i + 1, size - 1].min]}
     a
   end

   min = Integer(ARGV.shift || 42)
   max = Integer(ARGV.shift || 4242)
   min_percent_dups = Float(ARGV.shift || 0.15)

   small = dupd_array min, min_percent_dups
   big = dupd_array max, min_percent_dups

   [
     [small, 4242],
     [big, 1],
   ].each do |a, iterations|
     Benchmark.bm(10) do |bx|
       run_benchmark(bx, 'david', iterations) { a.uniq.find_all {|x| a.find_all {|y| y == x }.size > 1 } }
       run_benchmark(bx, 'jeff', iterations) { a.uniq.inject(a.dup) { |b,i| b.delete_at(b.index(i)) ; b } }
       run_benchmark(bx, 'jegII', iterations) { seen = Hash.new(0);a.select { |e| (seen[e] += 1) > 1 }.uniq }
       run_benchmark(bx, 'markvh', iterations) { t=[]; a.select{ |e| r=t.include?e; t<<e; r }.uniq }
       run_benchmark(bx, 'martin', iterations) { h1 = {};h2 = {};a.each {|i| h2[i] = true if h1[i];h1[i] = true };h2.keys }
       run_benchmark(bx, 'paolo', iterations) { h = {};u = a.inject([]) {|res, x| h[x] ? res + [x] : h[x] = res}.uniq }
       run_benchmark(bx, 'park', iterations) { t = Hash.new(0); a.each{|e|t[e]+=1};t.delete_if{|k,v|v==1}.keys }
       run_benchmark(bx, 'paul', iterations) { a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq }
       run_benchmark(bx, 'robert', iterations) { a.inject(Hash.new(0)) {|h,e| h[e]+=1;h}.select {|k,v|v>1}.map {|x,y|x} }
       run_benchmark(bx, 'samk1', iterations) { h = {};a.each do |i| if h[i] then h[i] += 1 else h[i] = 1 end;end;u = []; h.each do |k, v| if v > 1 then u << k end; end; u }
       run_benchmark(bx, 'samk2', iterations) { a.select{|e| a.grep(e).length > 1}.uniq }
       run_benchmark(bx, 'simonk', iterations) { a.select{|e| a.index(e) != a.rindex(e)}.uniq }
       run_benchmark(bx, 'simons', iterations) { uary = a.uniq; a.map.delete_if{|i| x=uary.member?(i); uary.delete(i); x} }
       run_benchmark(bx, 'zach', iterations) { t=[]; a.dup.delete_if{ |e| r=(not t.include? e); t.push(e); r }.uniq }
     end
   end


-a
-- 
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| anything that contradicts experience and logic should be abandoned.
| -- h.h. the 14th dalai lama
===============================================================================