--20cf30334e5b2a68db04b6b75956
Content-Type: text/plain; charset=UTF-8

On Tue, Jan 17, 2012 at 2:44 AM, Abinoam Jr. <abinoam / gmail.com> wrote:

> On Mon, Jan 16, 2012 at 10:05 PM, Abinoam Jr. <abinoam / gmail.com> wrote:
> >                     user     system      total        real
> > Ralph Shneiver:   0.290000   0.000000   0.290000 (  0.259640)
> > Sigurd:          0.320000   0.000000   0.320000 (  0.289873)
> > Keinich #1       0.560000   0.000000   0.560000 (  0.497736)
> > Keinich #2       0.280000   0.000000   0.280000 (  0.250843)
> > Magnus Holm:     0.310000   0.000000   0.310000 (  0.283344)
> > Abinoam #1:      1.140000   0.000000   1.140000 (  1.042744)
> >
> > Abinoam Jr.
> >
>
> Sorry for the mess... (some error in the bench code + horrible text
> wrapping)
>
> Apologizing with the gist...
> https://gist.github.com/1624016
>


Thanks. Very interesting.

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
"bins" of uniq values, that is essentially a form of "sorting", which can
be O(n^2)
if not careful.

Turns out I was wrong for ruby 1.9.3 (but right for some other rubies).

I rewrote the code to create large datasets (array up to 10_000_000), but
with
the data inside a set of 0...100.

require 'benchmark'

n  
#ar  4,5,6,4,5,6,6,7]
ar  ].tap{|a| 10_000_000.times {a << rand(100)}} #1_000_000, 2_000_000,
...
puts "SAMPLE of ar"
puts ar[0...20]
puts "SIZE"
puts ar.size
Benchmark.bm(15) do |b|
 b.report("Ralph Shneiver:"){ n.times { result  ash.new(0); ar.each { |x|
result[x] +  }; result} }
 b.report("Sigurd:") { n.times { ar.inject(Hash.new(0)) {|res, x| res[x] +1; res } } }
 b.report("Keinich #1") { n.times { Hash[ar.group_by{|n|n}.map{|k,v|[k,
v.size]}] } }
 b.report("Keinich #2") { n.times { Hash.new(0).tap{|h|ar.each{|n|h[n] +1}} } }
 b.report("Magnus Holm:") { n.times { ar.each_with_object(Hash.new(0)) {
|x, res| res[x] +  } } }
 b.report("Abinoam #1:") { n.times { Hash[ar.sort.chunk {|n| n}.map {|ix,
els| [ix, els.size] } ] } }
end



RUBY 1.9.3:
or ruby 1.9.3-p0 all solutions perform approx. the same (at least same
order)
and seem to stay quasi linear.

SIZE
1000000  #[1_000_000 that is ; n  ]
                      user     system      total        real
Ralph Shneiver:   0.180000   0.000000   0.180000 (  0.174283)
Sigurd:           0.200000   0.000000   0.200000 (  0.203652)
Keinich #1        0.140000   0.000000   0.140000 (  0.142833)
Keinich #2        0.180000   0.000000   0.180000 (  0.177456)
Magnus Holm:      0.200000   0.000000   0.200000 (  0.205895)
Abinoam #1:       0.260000   0.000000   0.260000 (  0.254554)

SIZE
2000000  #[2_000_000 that is ; n  ]
                      user     system      total        real
Ralph Shneiver:   0.340000   0.010000   0.350000 (  0.350032)
Sigurd:           0.410000   0.000000   0.410000 (  0.406483)
Keinich #1        0.280000   0.000000   0.280000 (  0.285213)
Keinich #2        0.350000   0.010000   0.360000 (  0.354640)
Magnus Holm:      0.410000   0.000000   0.410000 (  0.411010)
Abinoam #1:       0.470000   0.030000   0.500000 (  0.498782)


SIZE
10000000  #[10_000_000 that is; n   here]

                      user     system      total        real
Ralph Shneiver:   1.710000   0.040000   1.750000 (  1.748137)
Sigurd:           2.000000   0.010000   2.010000 (  2.012496)
Keinich #1        1.380000   0.030000   1.410000 (  1.409462)
Keinich #2        1.750000   0.010000   1.760000 (  1.760997)
Magnus Holm:      1.990000   0.020000   2.010000 (  2.014282)
Abinoam #1:       2.500000   0.060000   2.560000 (  2.562646)

All solutions in Ruby 1.9.3 are relatively "linear".

Keinich #1 is the fastest.

I believe it is because I limited the dataset to 100 different values, the
number of
bins is fixed and order O(n) can be achieved (not the O(n.log(n)) that I had
expected, incorrectly).

RUBY 1.8.7:
or ruby 1.8.7-p I could not test the last 2  solutions (methods not
implemented).
For the first 3, _significant_ differences arise.

SIZE
1000000 #[1_000_000 that is ; n  ]

                     user     system      total        real
Ralph Shneiver:  0.370000   0.010000   0.380000 (  0.369785)
Sigurd:          2.320000   0.030000   2.350000 (  2.360403)
Keinich #1       1.520000   0.040000   1.560000 (  1.562627)
Keinich #2      16.520000   0.100000  16.620000 ( 16.623032)

SIZE
2000000 #[2_000_000 that is ; n  ]
                     user     system      total        real
Ralph Shneiver:  0.720000   0.010000   0.730000 (  0.737673)
Sigurd:          8.040000   0.110000   8.150000 (  8.142827)
Keinich #1       5.670000   0.040000   5.710000 (  5.716364)
Keinich #2      52.600000   0.480000  53.080000 ( 53.100823)

SIZE
10000000  #[10_000_000 that is ; n  ]

                     user     system      total        real
Ralph Shneiver:  3.680000   0.000000   3.680000 (  3.680230)

Striking: Only the solution of Ralph Shneiver remains linear in MRI 1.8.7


JRUBY (1.6.5.1)
uby -v
jruby 1.6.5.1 (ruby-1.8.7-p330) (2011-12-27 1bf37c2) (OpenJDK Server VM
1.6.0_23) [linux-i386-java]

I also tested JRuby out of curiosity.

SIZE
1000000  #[1_000_000 that is ; n  ]
                     user     system      total        real
Ralph Shneiver:  0.244000   0.000000   0.244000 (  0.220000)
Sigurd:          0.264000   0.000000   0.264000 (  0.264000)
Keinich #1       0.112000   0.000000   0.112000 (  0.112000)
Keinich #2      11.256000   0.000000  11.256000 ( 11.256000)

SIZE
2000000  #[2_000_000 that is ; n  ]
                     user     system      total        real
Ralph Shneiver:  0.392000   0.000000   0.392000 (  0.368000)
Sigurd:          0.439000   0.000000   0.439000 (  0.438000)
Keinich #1       0.196000   0.000000   0.196000 (  0.196000)
Keinich #2      14.462000   0.000000  14.462000 ( 14.462000)

SIZE
10000000 #[10_000_000 that is ; n  ]]

                     user     system      total        real
Ralph Shneiver:  1.604000   0.000000   1.604000 (  1.581000)
Sigurd:          1.769000   0.000000   1.769000 (  1.769000)
Keinich #1       0.967000   0.000000   0.967000 (  0.967000)
Keinich #2       8.978000   0.000000   8.978000 (  8.978000)


Interesting again.

Those 2 solutions where not yet available on JRuby 1.6.5.1:

Magnus Holm:   NoMethodError: undefined method `each_with_object' for
#<Array:0x18eb7b8>
Abinoam #1:    NoMethodError: undefined method `chunk' for
#<Array:0x1719f30>

Ralph Shneiver remains linear again.

But then again ... Keinich #1 is significantly faster on 10_000_000 than
the others.
Keinich #2 seems to be not very predictable?

JRUBY HEAD (in rvm):


$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-17 7de254f) (Java HotSpot(TM)
Server VM 1.6.0_26) [linux-i386-java]

1000000  #[1_000_000 that is ; n  ]

                     user     system      total        real
Ralph Shneiver:  0.238000   0.000000   0.238000 (  0.223000)
Sigurd:          0.286000   0.000000   0.286000 (  0.286000)
Keinich #1       0.120000   0.000000   0.120000 (  0.120000)
Keinich #2       7.841000   0.000000   7.841000 (  7.841000)

SIZE
2000000  #[2_000_000 that is ; n  ]

                     user     system      total        real
Ralph Shneiver:  0.426000   0.000000   0.426000 (  0.410000)
Sigurd:          0.478000   0.000000   0.478000 (  0.478000)
Keinich #1       0.213000   0.000000   0.213000 (  0.213000)
Keinich #2      21.928000   0.000000  21.928000 ( 21.928000)

SIZE
10000000 #[10_000_000 that is ; n  ]

                     user     system      total        real
Ralph Shneiver:  1.550000   0.000000   1.550000 (  1.535000)
Sigurd:          1.795000   0.000000   1.795000 (  1.795000)
Keinich #1       1.060000   0.000000   1.060000 (  1.060000)
Keinich #2     116.826000   0.000000 116.826000 (116.826000)

Similar results to JRuby 1.6.5.1

Caveat:
id not check the correctness of the result, only the timing.

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now "ordered" so it can do an efficient
algorithm when adding an entry to a bin? Or is it object creation?

Why are certain methods leading to non-linear behavior?

My conclusions (?):


* be careful about performance with large data sets

* Fastest overall: Keinich #1 on JRuby 1.6.5.1

* CRuby 1.9.3 seems to be the only on the remains linear for all
  solutions on large data sets.

* "Ralph Shneiver" is the only solution that remains linear for
  all tested rubies (MRI 1.9.3, MRI 1.8.7, JRuby 1.6.5.1 JRuby head)

HTH,

Peter

--20cf30334e5b2a68db04b6b75956--