Just a quick Excel job to show who won this horse race:

name	real
martin	0.316044  <==Winner
jegII	0.332019
park	0.41347
samk1	0.428502
paolo	0.56615
jeff	0.576666
robert	0.658454
paul	0.681966
markvh	1.479563
zach	1.531747
simons	2.576182
simonk	3.002027
samk2	15.131783
david	17.98847


Probably as instructive of how to do it fast, was how to do it slow.
The brute-force one-pass method seemed to blow all else away.  Creating
a hash or array isn't nearly as expensive as the dup-N-sort, dup-N-uniq
or multi-pass approaches which require a search through the array/hash
on every iteration.  My mediocre 'scan' method was expensive in
start-up terms (dup & uniq & sort... yuk!), and just OK on a big array,
where start up was minimal.

Lesson learned for me:  When dealing with big or small arrays, look for
a one pass method, putting results into a newly created array/hash.
Avoid methods that create new dup's, sort, uniq, or scan/grep or find
or find_all.

Good job Martin!

Ara.T.Howard wrote:
> On Wed, 19 Oct 2005, Ryan Leavengood wrote:
>
> > Here are the latest results, with Park's solution and Sam's first solution
> > added to the mix:
>
> a couple of observations: zach's method destroys the array at each iteration
> and is unique in this respect.  to be fair a 'dup' must be added to his
> approach.  you arrays are composed in a quite regular fashion which give the
> algoritm's that favour linear search a huge advantage - a random distribution
> of dups makes the task much harder.  i implemented changes for the above and
> now:
>
>      harp:~ > ruby a.rb
>                      user     system      total        real
>      david       5.330000   0.000000   5.330000 (  5.350960)
>      jeff        0.370000   0.000000   0.370000 (  0.373027)
>      jegII       0.320000   0.010000   0.330000 (  0.324357)
>      markvh      0.610000   0.000000   0.610000 (  0.606635)
>      martin      0.300000   0.000000   0.300000 (  0.308640)
>      paolo       0.540000   0.000000   0.540000 (  0.537689)
>      park        0.400000   0.000000   0.400000 (  0.403531)
>      paul        0.660000   0.000000   0.660000 (  0.662007)
>      robert      0.640000   0.000000   0.640000 (  0.642736)
>      samk1       0.410000   0.000000   0.410000 (  0.418733)
>      samk2       4.730000   0.000000   4.730000 (  5.043174)
>      simonk      1.060000   0.000000   1.060000 (  1.052031)
>      simons      1.150000   0.000000   1.150000 (  1.160381)
>      zach        0.640000   0.000000   0.640000 (  0.641282)
>                      user     system      total        real
>      david      12.600000   0.000000  12.600000 ( 12.637510)
>      jeff        0.200000   0.000000   0.200000 (  0.203639)
>      jegII       0.010000   0.000000   0.010000 (  0.007662)
>      markvh      0.880000   0.000000   0.880000 (  0.872928)
>      martin      0.010000   0.000000   0.010000 (  0.007404)
>      paolo       0.030000   0.000000   0.030000 (  0.028461)
>      park        0.010000   0.000000   0.010000 (  0.009939)
>      paul        0.020000   0.000000   0.020000 (  0.019959)
>      robert      0.010000   0.010000   0.020000 (  0.015718)
>      samk1       0.010000   0.000000   0.010000 (  0.009769)
>      samk2      10.050000   0.000000  10.050000 ( 10.088609)
>      simonk      1.950000   0.000000   1.950000 (  1.949996)
>      simons      1.420000   0.000000   1.420000 (  1.415801)
>      zach        0.890000   0.000000   0.890000 (  0.890465)
>
> jegII, martin, and samk1 are two orders of magnitude faster than the delete_if
> (destroys 'a') approach!
>
> also, your concept of 'big' doesn't seem that, er, big - running with larger
> numbers is also revealing - i started a process with
>
>    harp:~ > ruby a.rb 424 42424  (factor of 10 increase in size)
>
> 1 hour ago and it's still not complete so many of the algorithms obviously
> scale horribly.  i'll post the results when (if) it finishes.
>
>
> here's the code:
>
>    harp:~ > cat a.rb
>    require 'benchmark'
>
>    def run_benchmark(bx, name, iterations, &block)
>      GC.start
>      fork{ bx.report(name) { iterations.times &block }; exit! }
>      Process::wait
>    end
>
>    def dupd_array size, min_percent_dups
>      a = Array::new(size).map{ rand size }
>      n = ((size * min_percent_dups) / 2).ceil
>      n.times{ i = rand(size); a[i] = a[[i + 1, size - 1].min]}
>      a
>    end
>
>    min = Integer(ARGV.shift || 42)
>    max = Integer(ARGV.shift || 4242)
>    min_percent_dups = Float(ARGV.shift || 0.15)
>
>    small = dupd_array min, min_percent_dups
>    big = dupd_array max, min_percent_dups
>
>    [
>      [small, 4242],
>      [big, 1],
>    ].each do |a, iterations|
>      Benchmark.bm(10) do |bx|
>        run_benchmark(bx, 'david', iterations) { a.uniq.find_all {|x| a.find_all {|y| y == x }.size > 1 } }
>        run_benchmark(bx, 'jeff', iterations) { a.uniq.inject(a.dup) { |b,i| b.delete_at(b.index(i)) ; b } }
>        run_benchmark(bx, 'jegII', iterations) { seen = Hash.new(0);a.select { |e| (seen[e] += 1) > 1 }.uniq }
>        run_benchmark(bx, 'markvh', iterations) { t=[]; a.select{ |e| r=t.include?e; t<<e; r }.uniq }
>        run_benchmark(bx, 'martin', iterations) { h1 = {};h2 = {};a.each {|i| h2[i] = true if h1[i];h1[i] = true };h2.keys }
>        run_benchmark(bx, 'paolo', iterations) { h = {};u = a.inject([]) {|res, x| h[x] ? res + [x] : h[x] = res}.uniq }
>        run_benchmark(bx, 'park', iterations) { t = Hash.new(0); a.each{|e|t[e]+=1};t.delete_if{|k,v|v==1}.keys }
>        run_benchmark(bx, 'paul', iterations) { a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq }
>        run_benchmark(bx, 'robert', iterations) { a.inject(Hash.new(0)) {|h,e| h[e]+=1;h}.select {|k,v|v>1}.map {|x,y|x} }
>        run_benchmark(bx, 'samk1', iterations) { h = {};a.each do |i| if h[i] then h[i] += 1 else h[i] = 1 end;end;u = []; h.each do |k, v| if v > 1 then u << k end; end; u }
>        run_benchmark(bx, 'samk2', iterations) { a.select{|e| a.grep(e).length > 1}.uniq }
>        run_benchmark(bx, 'simonk', iterations) { a.select{|e| a.index(e) != a.rindex(e)}.uniq }
>        run_benchmark(bx, 'simons', iterations) { uary = a.uniq; a.map.delete_if{|i| x=uary.member?(i); uary.delete(i); x} }
>        run_benchmark(bx, 'zach', iterations) { t=[]; a.dup.delete_if{ |e| r=(not t.include? e); t.push(e); r }.uniq }
>      end
>    end
>
>
> -a
> --
> ===============================================================================
> | email :: ara [dot] t [dot] howard [at] noaa [dot] gov
> | phone :: 303.497.6469
> | anything that contradicts experience and logic should be abandoned.
> | -- h.h. the 14th dalai lama
> ===============================================================================