--20cf30334e5b2a68db04b6b75956 Content-Type: text/plain; charset=UTF-8 On Tue, Jan 17, 2012 at 2:44 AM, Abinoam Jr. <abinoam / gmail.com> wrote: > On Mon, Jan 16, 2012 at 10:05 PM, Abinoam Jr. <abinoam / gmail.com> wrote: > > user system total real > > Ralph Shneiver: 0.290000 0.000000 0.290000 ( 0.259640) > > Sigurd: 0.320000 0.000000 0.320000 ( 0.289873) > > Keinich #1 0.560000 0.000000 0.560000 ( 0.497736) > > Keinich #2 0.280000 0.000000 0.280000 ( 0.250843) > > Magnus Holm: 0.310000 0.000000 0.310000 ( 0.283344) > > Abinoam #1: 1.140000 0.000000 1.140000 ( 1.042744) > > > > Abinoam Jr. > > > > Sorry for the mess... (some error in the bench code + horrible text > wrapping) > > Apologizing with the gist... > https://gist.github.com/1624016 > Thanks. Very interesting. I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets there would be a significant performance difference. Because, upon determining the "bins" of uniq values, that is essentially a form of "sorting", which can be O(n^2) if not careful. Turns out I was wrong for ruby 1.9.3 (but right for some other rubies). I rewrote the code to create large datasets (array up to 10_000_000), but with the data inside a set of 0...100. require 'benchmark' n #ar 4,5,6,4,5,6,6,7] ar ].tap{|a| 10_000_000.times {a << rand(100)}} #1_000_000, 2_000_000, ... puts "SAMPLE of ar" puts ar[0...20] puts "SIZE" puts ar.size Benchmark.bm(15) do |b| b.report("Ralph Shneiver:"){ n.times { result ash.new(0); ar.each { |x| result[x] + }; result} } b.report("Sigurd:") { n.times { ar.inject(Hash.new(0)) {|res, x| res[x] +1; res } } } b.report("Keinich #1") { n.times { Hash[ar.group_by{|n|n}.map{|k,v|[k, v.size]}] } } b.report("Keinich #2") { n.times { Hash.new(0).tap{|h|ar.each{|n|h[n] +1}} } } b.report("Magnus Holm:") { n.times { ar.each_with_object(Hash.new(0)) { |x, res| res[x] + } } } b.report("Abinoam #1:") { n.times { Hash[ar.sort.chunk {|n| n}.map {|ix, els| [ix, els.size] } ] } } end RUBY 1.9.3: or ruby 1.9.3-p0 all solutions perform approx. the same (at least same order) and seem to stay quasi linear. SIZE 1000000 #[1_000_000 that is ; n ] user system total real Ralph Shneiver: 0.180000 0.000000 0.180000 ( 0.174283) Sigurd: 0.200000 0.000000 0.200000 ( 0.203652) Keinich #1 0.140000 0.000000 0.140000 ( 0.142833) Keinich #2 0.180000 0.000000 0.180000 ( 0.177456) Magnus Holm: 0.200000 0.000000 0.200000 ( 0.205895) Abinoam #1: 0.260000 0.000000 0.260000 ( 0.254554) SIZE 2000000 #[2_000_000 that is ; n ] user system total real Ralph Shneiver: 0.340000 0.010000 0.350000 ( 0.350032) Sigurd: 0.410000 0.000000 0.410000 ( 0.406483) Keinich #1 0.280000 0.000000 0.280000 ( 0.285213) Keinich #2 0.350000 0.010000 0.360000 ( 0.354640) Magnus Holm: 0.410000 0.000000 0.410000 ( 0.411010) Abinoam #1: 0.470000 0.030000 0.500000 ( 0.498782) SIZE 10000000 #[10_000_000 that is; n here] user system total real Ralph Shneiver: 1.710000 0.040000 1.750000 ( 1.748137) Sigurd: 2.000000 0.010000 2.010000 ( 2.012496) Keinich #1 1.380000 0.030000 1.410000 ( 1.409462) Keinich #2 1.750000 0.010000 1.760000 ( 1.760997) Magnus Holm: 1.990000 0.020000 2.010000 ( 2.014282) Abinoam #1: 2.500000 0.060000 2.560000 ( 2.562646) All solutions in Ruby 1.9.3 are relatively "linear". Keinich #1 is the fastest. I believe it is because I limited the dataset to 100 different values, the number of bins is fixed and order O(n) can be achieved (not the O(n.log(n)) that I had expected, incorrectly). RUBY 1.8.7: or ruby 1.8.7-p I could not test the last 2 solutions (methods not implemented). For the first 3, _significant_ differences arise. SIZE 1000000 #[1_000_000 that is ; n ] user system total real Ralph Shneiver: 0.370000 0.010000 0.380000 ( 0.369785) Sigurd: 2.320000 0.030000 2.350000 ( 2.360403) Keinich #1 1.520000 0.040000 1.560000 ( 1.562627) Keinich #2 16.520000 0.100000 16.620000 ( 16.623032) SIZE 2000000 #[2_000_000 that is ; n ] user system total real Ralph Shneiver: 0.720000 0.010000 0.730000 ( 0.737673) Sigurd: 8.040000 0.110000 8.150000 ( 8.142827) Keinich #1 5.670000 0.040000 5.710000 ( 5.716364) Keinich #2 52.600000 0.480000 53.080000 ( 53.100823) SIZE 10000000 #[10_000_000 that is ; n ] user system total real Ralph Shneiver: 3.680000 0.000000 3.680000 ( 3.680230) Striking: Only the solution of Ralph Shneiver remains linear in MRI 1.8.7 JRUBY (1.6.5.1) uby -v jruby 1.6.5.1 (ruby-1.8.7-p330) (2011-12-27 1bf37c2) (OpenJDK Server VM 1.6.0_23) [linux-i386-java] I also tested JRuby out of curiosity. SIZE 1000000 #[1_000_000 that is ; n ] user system total real Ralph Shneiver: 0.244000 0.000000 0.244000 ( 0.220000) Sigurd: 0.264000 0.000000 0.264000 ( 0.264000) Keinich #1 0.112000 0.000000 0.112000 ( 0.112000) Keinich #2 11.256000 0.000000 11.256000 ( 11.256000) SIZE 2000000 #[2_000_000 that is ; n ] user system total real Ralph Shneiver: 0.392000 0.000000 0.392000 ( 0.368000) Sigurd: 0.439000 0.000000 0.439000 ( 0.438000) Keinich #1 0.196000 0.000000 0.196000 ( 0.196000) Keinich #2 14.462000 0.000000 14.462000 ( 14.462000) SIZE 10000000 #[10_000_000 that is ; n ]] user system total real Ralph Shneiver: 1.604000 0.000000 1.604000 ( 1.581000) Sigurd: 1.769000 0.000000 1.769000 ( 1.769000) Keinich #1 0.967000 0.000000 0.967000 ( 0.967000) Keinich #2 8.978000 0.000000 8.978000 ( 8.978000) Interesting again. Those 2 solutions where not yet available on JRuby 1.6.5.1: Magnus Holm: NoMethodError: undefined method `each_with_object' for #<Array:0x18eb7b8> Abinoam #1: NoMethodError: undefined method `chunk' for #<Array:0x1719f30> Ralph Shneiver remains linear again. But then again ... Keinich #1 is significantly faster on 10_000_000 than the others. Keinich #2 seems to be not very predictable? JRUBY HEAD (in rvm): $ ruby -v jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-17 7de254f) (Java HotSpot(TM) Server VM 1.6.0_26) [linux-i386-java] 1000000 #[1_000_000 that is ; n ] user system total real Ralph Shneiver: 0.238000 0.000000 0.238000 ( 0.223000) Sigurd: 0.286000 0.000000 0.286000 ( 0.286000) Keinich #1 0.120000 0.000000 0.120000 ( 0.120000) Keinich #2 7.841000 0.000000 7.841000 ( 7.841000) SIZE 2000000 #[2_000_000 that is ; n ] user system total real Ralph Shneiver: 0.426000 0.000000 0.426000 ( 0.410000) Sigurd: 0.478000 0.000000 0.478000 ( 0.478000) Keinich #1 0.213000 0.000000 0.213000 ( 0.213000) Keinich #2 21.928000 0.000000 21.928000 ( 21.928000) SIZE 10000000 #[10_000_000 that is ; n ] user system total real Ralph Shneiver: 1.550000 0.000000 1.550000 ( 1.535000) Sigurd: 1.795000 0.000000 1.795000 ( 1.795000) Keinich #1 1.060000 0.000000 1.060000 ( 1.060000) Keinich #2 116.826000 0.000000 116.826000 (116.826000) Similar results to JRuby 1.6.5.1 Caveat: id not check the correctness of the result, only the timing. Questions: Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ? Could it be because the Hash is now "ordered" so it can do an efficient algorithm when adding an entry to a bin? Or is it object creation? Why are certain methods leading to non-linear behavior? My conclusions (?): * be careful about performance with large data sets * Fastest overall: Keinich #1 on JRuby 1.6.5.1 * CRuby 1.9.3 seems to be the only on the remains linear for all solutions on large data sets. * "Ralph Shneiver" is the only solution that remains linear for all tested rubies (MRI 1.9.3, MRI 1.8.7, JRuby 1.6.5.1 JRuby head) HTH, Peter --20cf30334e5b2a68db04b6b75956--