Just a quick Excel job to show who won this horse race: name real martin 0.316044 <==Winner jegII 0.332019 park 0.41347 samk1 0.428502 paolo 0.56615 jeff 0.576666 robert 0.658454 paul 0.681966 markvh 1.479563 zach 1.531747 simons 2.576182 simonk 3.002027 samk2 15.131783 david 17.98847 Probably as instructive of how to do it fast, was how to do it slow. The brute-force one-pass method seemed to blow all else away. Creating a hash or array isn't nearly as expensive as the dup-N-sort, dup-N-uniq or multi-pass approaches which require a search through the array/hash on every iteration. My mediocre 'scan' method was expensive in start-up terms (dup & uniq & sort... yuk!), and just OK on a big array, where start up was minimal. Lesson learned for me: When dealing with big or small arrays, look for a one pass method, putting results into a newly created array/hash. Avoid methods that create new dup's, sort, uniq, or scan/grep or find or find_all. Good job Martin! Ara.T.Howard wrote: > On Wed, 19 Oct 2005, Ryan Leavengood wrote: > > > Here are the latest results, with Park's solution and Sam's first solution > > added to the mix: > > a couple of observations: zach's method destroys the array at each iteration > and is unique in this respect. to be fair a 'dup' must be added to his > approach. you arrays are composed in a quite regular fashion which give the > algoritm's that favour linear search a huge advantage - a random distribution > of dups makes the task much harder. i implemented changes for the above and > now: > > harp:~ > ruby a.rb > user system total real > david 5.330000 0.000000 5.330000 ( 5.350960) > jeff 0.370000 0.000000 0.370000 ( 0.373027) > jegII 0.320000 0.010000 0.330000 ( 0.324357) > markvh 0.610000 0.000000 0.610000 ( 0.606635) > martin 0.300000 0.000000 0.300000 ( 0.308640) > paolo 0.540000 0.000000 0.540000 ( 0.537689) > park 0.400000 0.000000 0.400000 ( 0.403531) > paul 0.660000 0.000000 0.660000 ( 0.662007) > robert 0.640000 0.000000 0.640000 ( 0.642736) > samk1 0.410000 0.000000 0.410000 ( 0.418733) > samk2 4.730000 0.000000 4.730000 ( 5.043174) > simonk 1.060000 0.000000 1.060000 ( 1.052031) > simons 1.150000 0.000000 1.150000 ( 1.160381) > zach 0.640000 0.000000 0.640000 ( 0.641282) > user system total real > david 12.600000 0.000000 12.600000 ( 12.637510) > jeff 0.200000 0.000000 0.200000 ( 0.203639) > jegII 0.010000 0.000000 0.010000 ( 0.007662) > markvh 0.880000 0.000000 0.880000 ( 0.872928) > martin 0.010000 0.000000 0.010000 ( 0.007404) > paolo 0.030000 0.000000 0.030000 ( 0.028461) > park 0.010000 0.000000 0.010000 ( 0.009939) > paul 0.020000 0.000000 0.020000 ( 0.019959) > robert 0.010000 0.010000 0.020000 ( 0.015718) > samk1 0.010000 0.000000 0.010000 ( 0.009769) > samk2 10.050000 0.000000 10.050000 ( 10.088609) > simonk 1.950000 0.000000 1.950000 ( 1.949996) > simons 1.420000 0.000000 1.420000 ( 1.415801) > zach 0.890000 0.000000 0.890000 ( 0.890465) > > jegII, martin, and samk1 are two orders of magnitude faster than the delete_if > (destroys 'a') approach! > > also, your concept of 'big' doesn't seem that, er, big - running with larger > numbers is also revealing - i started a process with > > harp:~ > ruby a.rb 424 42424 (factor of 10 increase in size) > > 1 hour ago and it's still not complete so many of the algorithms obviously > scale horribly. i'll post the results when (if) it finishes. > > > here's the code: > > harp:~ > cat a.rb > require 'benchmark' > > def run_benchmark(bx, name, iterations, &block) > GC.start > fork{ bx.report(name) { iterations.times &block }; exit! } > Process::wait > end > > def dupd_array size, min_percent_dups > a = Array::new(size).map{ rand size } > n = ((size * min_percent_dups) / 2).ceil > n.times{ i = rand(size); a[i] = a[[i + 1, size - 1].min]} > a > end > > min = Integer(ARGV.shift || 42) > max = Integer(ARGV.shift || 4242) > min_percent_dups = Float(ARGV.shift || 0.15) > > small = dupd_array min, min_percent_dups > big = dupd_array max, min_percent_dups > > [ > [small, 4242], > [big, 1], > ].each do |a, iterations| > Benchmark.bm(10) do |bx| > run_benchmark(bx, 'david', iterations) { a.uniq.find_all {|x| a.find_all {|y| y == x }.size > 1 } } > run_benchmark(bx, 'jeff', iterations) { a.uniq.inject(a.dup) { |b,i| b.delete_at(b.index(i)) ; b } } > run_benchmark(bx, 'jegII', iterations) { seen = Hash.new(0);a.select { |e| (seen[e] += 1) > 1 }.uniq } > run_benchmark(bx, 'markvh', iterations) { t=[]; a.select{ |e| r=t.include?e; t<<e; r }.uniq } > run_benchmark(bx, 'martin', iterations) { h1 = {};h2 = {};a.each {|i| h2[i] = true if h1[i];h1[i] = true };h2.keys } > run_benchmark(bx, 'paolo', iterations) { h = {};u = a.inject([]) {|res, x| h[x] ? res + [x] : h[x] = res}.uniq } > run_benchmark(bx, 'park', iterations) { t = Hash.new(0); a.each{|e|t[e]+=1};t.delete_if{|k,v|v==1}.keys } > run_benchmark(bx, 'paul', iterations) { a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } > run_benchmark(bx, 'robert', iterations) { a.inject(Hash.new(0)) {|h,e| h[e]+=1;h}.select {|k,v|v>1}.map {|x,y|x} } > run_benchmark(bx, 'samk1', iterations) { h = {};a.each do |i| if h[i] then h[i] += 1 else h[i] = 1 end;end;u = []; h.each do |k, v| if v > 1 then u << k end; end; u } > run_benchmark(bx, 'samk2', iterations) { a.select{|e| a.grep(e).length > 1}.uniq } > run_benchmark(bx, 'simonk', iterations) { a.select{|e| a.index(e) != a.rindex(e)}.uniq } > run_benchmark(bx, 'simons', iterations) { uary = a.uniq; a.map.delete_if{|i| x=uary.member?(i); uary.delete(i); x} } > run_benchmark(bx, 'zach', iterations) { t=[]; a.dup.delete_if{ |e| r=(not t.include? e); t.push(e); r }.uniq } > end > end > > > -a > -- > =============================================================================== > | email :: ara [dot] t [dot] howard [at] noaa [dot] gov > | phone :: 303.497.6469 > | anything that contradicts experience and logic should be abandoned. > | -- h.h. the 14th dalai lama > ===============================================================================