On 10/17/05, pauldacus / gmail.com <pauldacus / gmail.com> wrote: > > And so a test: > > (REGEX) a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq > vs. > (INDEX) a.select{|e| a.index(e) != a.rindex(e)}.uniq > > I ran these on an array with 11,000 elements, 2,000 of which were > duplicates, as follows: > > (1..10000).to_a + (2000..3000).to_a > > The scan (REGEX) script finished in 0.21 seconds > The (INDEX) script ran in 28.77 seconds > > The regex was over 100X faster! Thanks for this. I was wondering if my method of benchmarking might be flawed, and it sort of is in this case. But it makes sense here because the scan just has to make one run through the string, whereas each iteration of the select has multiple calls to an index which might potentially traverse the entire array. As my benchmark shows though, a lot of searches in a smaller array is faster using the index. Also another problem with the string scanning is that it depends on the array elements being numbers (at least in the case above.) I decided to benchmark again using your method and adding some more implementations. Here we go (beware of wrapped lines): require 'benchmark' include Benchmark [[[0,1,2,3,4,5,2,3], 50000],[(1..5000).to_a + (2000..2500).to_a, 1]].each do |a, iterations| bm(12) do |bx| bx.report('simonk') { iterations.times { a.select{|e| a.index(e) != a.rindex(e)}.uniq } } bx.report('samk ') { iterations.times { a.select{|e| a.grep(e).length > 1}.uniq } } bx.report('paul ') { iterations.times { a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } } bx.report('jeff ') { iterations.times { a.uniq.inject(a.dup) { |b,i| b.delete_at(b.index(i)) ; b } } } bx.report('paolo ') { iterations.times { h = {};u = a.inject([]) {|res, x| h[x] ? res + [x] : h[x] = res}.uniq } } bx.report('jegII ') { iterations.times { seen = Hash.new(0);a.select { |e| (seen[e] += 1) > 1 }.uniq } } bx.report('simons') { iterations.times { uary = a.uniq; a.map.delete_if{|i| x=uary.member?(i); uary.delete(i); x} } } bx.report('robert') { iterations.times { a.inject(Hash.new(0)) {|h,e| h[e]+=1;h}.select {|k,v|v>1}.map {|x,y|x} } } bx.report('david ') { iterations.times { a.uniq.find_all {|x| a.find_all {|y| y == x }.size > 1 } } } bx.report('zach ') { iterations.times { t=[]; a.delete_if{ |e| r=(not t.include? e); t.push(e); r }.uniq } } bx.report('markvh') { iterations.times { t=[]; a.select{ |e| r=t.include?e; t<<e; r }.uniq } } bx.report('martin') { iterations.times { h1 = {};h2 = {};a.each {|i| h2[i] = true if h1[i];h1[i] = true };h2.keys } } end end The results: Small array, lots of iterations: user system total real simonk 1.344000 0.000000 1.344000 ( 1.344000) samk 3.859000 0.000000 3.859000 ( 3.876000) paul 3.532000 0.000000 3.532000 ( 3.547000) jeff 1.640000 0.000000 1.640000 ( 1.641000) paolo 2.656000 0.000000 2.656000 ( 2.672000) jegII 1.704000 0.016000 1.720000 ( 1.719000) simons 2.109000 0.000000 2.109000 ( 2.110000) robert 3.172000 0.015000 3.187000 ( 3.203000) david 5.078000 0.000000 5.078000 ( 5.079000) zach 0.328000 0.000000 0.328000 ( 0.328000) markvh 0.391000 0.000000 0.391000 ( 0.391000) martin 0.390000 0.000000 0.390000 ( 0.484000) Big array, one iteration: user system total real simonk 4.125000 0.000000 4.125000 ( 4.453000) samk 17.125000 0.000000 17.125000 ( 17.283000) paul 0.063000 0.000000 0.063000 ( 0.062000) jeff 0.109000 0.000000 0.109000 ( 0.110000) paolo 0.031000 0.000000 0.031000 ( 0.031000) jegII 0.016000 0.000000 0.016000 ( 0.016000) simons 2.219000 0.000000 2.219000 ( 2.265000) robert 0.031000 0.000000 0.031000 ( 0.032000) david 34.188000 0.016000 34.204000 ( 35.127000) zach 2.156000 0.000000 2.156000 ( 2.203000) markvh 0.015000 0.000000 0.015000 ( 0.016000) martin 0.000000 0.000000 0.000000 ( 0.000000) Overall Martin's was the best. As he suspected, David's was quite slow. Sam's as well. But it is interesting to see the varying results for each algorithm based on how they are used. Ryan