Dominik Bathon wrote:
> Here is my solution:
> 
> The result:
> 
> $ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample
> 
> real    0m30.493s
> user    0m29.435s
> sys     0m0.681s
> 
> on a Pentium M 1500MHz, 512MB RAM

$ time ruby Sampling.rb 5_000_000 1_000_000_000 > big_sample

real    5m20.404s
user    5m18.600s
sys     0m1.480s

on a 1.2 GHz Duron with 512 MB RAM.

The odd thing is, that I could not see where to optimise to get near the
below 1 minute border.

In the end I tried 5 approaches (one being inspired by Dominiks idea
with the Hash, one by Joost).

I wrote a simple class to hold the different methods I used, where
@count is the number of members to be reported, @limit is the upper
bound of the range. @numbers is an Array and @numberHash a - guess - Hash.

The obvious method (at least for me) was to insert a random number into
an Array, checking whether the Array did not include the number and
redo-ing if it was already included.

def generate_with_array_include
  @numbers.clear
  @count.times do
    newNumber = rand(@limit)
    if @numbers.include?(newNumber)
      redo
    else
      @numbers << newNumber
    end
  end
end

A different way to do the same is

def generate_with_uniq_first
  @numbers.clear
  @count.times do
    @numbers << rand(@limit)
  end
  @numbers.uniq!
  while @numbers.length < @count
    @numbers << rand(@limit)
    @numbers.uniq!
  end
end

The idea behind this is, that with a big difference between @count and
@limit there is a slight chance, that no duplicate numbers will be
returned by Kernel#rand. That needs to be checked - so call Array#uniq!
to delete the duplicates. After that simply loop until the Array is full
(means, it has the required number of elements), calling uniq! in each
iteration. That performed better than 'generate_with_array_include' (30
times faster).

def generate_with_uniq_second
  @numbers.clear
  while @numbers.length < @count
    (@count - @numbers.length).times do
      @numbers << rand(@limit)
    end
    @numbers.uniq!
  end
end

This is even faster (takes only half the time as
'generate_with_uniq_first') and looks better to me.

The 4th method uses the Hash approach Dominik chose.

def generate_with_hash_has_key
  @numberHash.clear
  @count.times do
    newNumber = rand(@limit)
    if @numberHash.has_key?(newNumber)
      redo
    else
      @numberHash[newNumber] = nil
    end
  end
  @numbers = @numberHash.keys
end

This is an adaptation to using hash instead of Array from my initial
algorithm.

This method is slightly faster than my second try with uniq! - with
@count being 5_000 and @limit 1_000_000. With bigger numbers the
difference grows.

The last line stores the keys in the @numbers-Array, so I didn't need to
change my output method - which looks like that:

def to_s
  @numbers.sort.join("\n")
end

The 5th method is inspired by Joosts Hash#length idea. After I read it I
thought "This is so simple - why did you spend half the day thinking
about how to reach their 'below 1 minute' timings?".

def generate_with_hash_length
  while @numberHash.length < @count
    @numberHash[rand(@limit)] = nil
  end
  @numbers = @numberHash.keys
end

And this is the winner (from what I tried).

Here are some timings (without output) from a smaller sample (look at
'generate_with_array_include', I really didn't want to use this with
bigger numbers)

ruby -v Sampling.rb 50_000 10_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
       user     system      total        real
generate_with_uniq_first 17.790000   0.020000  17.810000 ( 17.806102)
generate_with_uniq_second  0.950000   0.010000   0.960000 (  0.956298)
generate_with_array_include 243.300000   0.090000 243.390000 (243.969158)
generate_with_hash_has_key  0.310000   0.000000   0.310000 (  0.308837)
generate_with_hash_length  0.020000   0.000000   0.020000 (  0.019657)

And here some timings (without output as well) from the top3 methods
(using the original parameters)

ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
      user     system      total        real
generate_with_uniq_second 95.240000   1.070000  96.310000 ( 96.511207)
generate_with_hash_has_key 63.570000   0.970000  64.540000 ( 64.641186)
generate_with_hash_length  5.720000   0.000000   5.720000 (  5.718732)

So, even my best method using Array and uniq! turned out to be not
suited. Well, at the end I discovered the bmbm method of Benchmark. It
gives different results for the second run:

ruby -v Sampling.rb 5_000_000 1_000_000_000
ruby 1.8.2 (2005-04-11) [i386-linux]
Rehearsal --------------------------------------------------------------
generate_with_uniq_second   95.640000   0.850000  96.490000 ( 96.563648)
generate_with_hash_has_key  63.280000   1.270000  64.550000 ( 64.687427)
generate_with_hash_length    5.720000   0.000000   5.720000 (  5.718150)
--------------------------------------------------- total: 166.760000sec

                                 user     system      total        real
generate_with_uniq_second   99.790000   0.740000 100.530000 (100.740020)
generate_with_hash_has_key  45.050000   0.830000  45.880000 ( 45.987040)
generate_with_hash_length    3.130000   0.000000   3.130000 (  3.132459)


I would have never thought about using a Hash in this quiz. I really do
not need the value of any stored key. It is an Array-problem. List of
numbers. Sorted. (unique)

Why use a Hash?

Some of you did use it. Do you mind telling me why?

Anyway, it was a great quiz.

Final timings, now using 'generate_with_hash_length':

time ruby -d Sampling.rb 5_000_000 1_000_000_000 > big_sample_new

real    1m17.537s
user    0m58.200s
sys     0m1.180s

Stefan Mahlitz