On 10/17/05, pauldacus / gmail.com <pauldacus / gmail.com> wrote:
>
> And so a test:
>
> (REGEX) a.sort.join(' ').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq
>    vs.
> (INDEX) a.select{|e| a.index(e) != a.rindex(e)}.uniq
>
> I ran these on an array with 11,000 elements, 2,000 of which were
> duplicates, as follows:
>
>  (1..10000).to_a + (2000..3000).to_a
>
> The scan (REGEX) script finished in 0.21 seconds
> The (INDEX) script ran in 28.77 seconds
>
> The regex was over 100X faster!

Thanks for this. I was wondering if my method of benchmarking might be
flawed, and it sort of is in this case. But it makes sense here
because the scan just has to make one run through the string, whereas
each iteration of the select has multiple calls to an index which
might potentially traverse the entire array. As my benchmark shows
though, a lot of searches in a smaller array is faster using the
index.

Also another problem with the string scanning is that it depends on
the array elements being numbers (at least in the case above.)

I decided to benchmark again using your method and adding some more
implementations. Here we go (beware of wrapped lines):

require 'benchmark'
include Benchmark

[[[0,1,2,3,4,5,2,3], 50000],[(1..5000).to_a + (2000..2500).to_a,
1]].each do |a, iterations|
  bm(12) do |bx|
    bx.report('simonk') { iterations.times { a.select{|e| a.index(e)
!= a.rindex(e)}.uniq } }
    bx.report('samk  ') { iterations.times { a.select{|e|
a.grep(e).length > 1}.uniq } }
    bx.report('paul  ') { iterations.times { a.sort.join('
').to_s.scan(/(\d+)\s(\1)+/).flatten.uniq } }
    bx.report('jeff  ') { iterations.times { a.uniq.inject(a.dup) {
|b,i| b.delete_at(b.index(i)) ; b } } }
    bx.report('paolo ') { iterations.times { h = {};u = a.inject([])
{|res, x| h[x] ? res + [x] : h[x] = res}.uniq } }
    bx.report('jegII ') { iterations.times { seen =
Hash.new(0);a.select { |e| (seen[e] += 1) > 1 }.uniq } }
    bx.report('simons') { iterations.times { uary = a.uniq;
a.map.delete_if{|i| x=uary.member?(i); uary.delete(i); x} } }
    bx.report('robert') { iterations.times { a.inject(Hash.new(0))
{|h,e| h[e]+=1;h}.select {|k,v|v>1}.map {|x,y|x} } }
    bx.report('david ') { iterations.times { a.uniq.find_all {|x|
a.find_all {|y| y == x }.size > 1 } } }
    bx.report('zach  ') { iterations.times { t=[]; a.delete_if{ |e|
r=(not t.include? e); t.push(e); r }.uniq } }
    bx.report('markvh') { iterations.times { t=[]; a.select{ |e|
r=t.include?e; t<<e; r }.uniq } }
    bx.report('martin') { iterations.times { h1 = {};h2 = {};a.each
{|i| h2[i] = true if h1[i];h1[i] = true };h2.keys } }
  end
end

The results:

Small array, lots of iterations:
                  user     system      total        real
simonk        1.344000   0.000000   1.344000 (  1.344000)
samk          3.859000   0.000000   3.859000 (  3.876000)
paul          3.532000   0.000000   3.532000 (  3.547000)
jeff          1.640000   0.000000   1.640000 (  1.641000)
paolo         2.656000   0.000000   2.656000 (  2.672000)
jegII         1.704000   0.016000   1.720000 (  1.719000)
simons        2.109000   0.000000   2.109000 (  2.110000)
robert        3.172000   0.015000   3.187000 (  3.203000)
david         5.078000   0.000000   5.078000 (  5.079000)
zach          0.328000   0.000000   0.328000 (  0.328000)
markvh        0.391000   0.000000   0.391000 (  0.391000)
martin        0.390000   0.000000   0.390000 (  0.484000)

Big array, one iteration:
                  user     system      total        real
simonk        4.125000   0.000000   4.125000 (  4.453000)
samk         17.125000   0.000000  17.125000 ( 17.283000)
paul          0.063000   0.000000   0.063000 (  0.062000)
jeff          0.109000   0.000000   0.109000 (  0.110000)
paolo         0.031000   0.000000   0.031000 (  0.031000)
jegII         0.016000   0.000000   0.016000 (  0.016000)
simons        2.219000   0.000000   2.219000 (  2.265000)
robert        0.031000   0.000000   0.031000 (  0.032000)
david        34.188000   0.016000  34.204000 ( 35.127000)
zach          2.156000   0.000000   2.156000 (  2.203000)
markvh        0.015000   0.000000   0.015000 (  0.016000)
martin        0.000000   0.000000   0.000000 (  0.000000)

Overall Martin's was the best. As he suspected, David's was quite
slow. Sam's as well.

But it is interesting to see the varying results for each algorithm
based on how they are used.

Ryan