On Jun 20, 2006, at 4:08, Daniel Martin wrote:

> Quoting Matthew Smillie <M.B.Smillie / sms.ed.ac.uk>:
>
>> On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:
>>> I think it's a little bit to simple to be put in the standard lib.
>>>
>>> a.sort_by{rand}.each {|e| print e , ' '}
>>
>> But it turns out to not be that simple.  This can give a biased  
>> sort depending on the underlying sort algorithm,
>>
>> The linked Haskell implementation also gives a very thorough  
>> discussion, including a demonstration of how a stable sorting  
>> algorithm will yield biased results.
>
> Note that the discussion there does not apply to the implementation  
> quoted, but rather to this implementation:
>
> a.sort_by{rand(a.size)}.each {|e| print e , ' '}
>
> which I think we can all see the problem with.  Absent any  
> collisions in the values chosen by rand each time through the  
> block, the short implementation quoted above is unbiased.

You don't provide anything to back up this assertion.  What makes  
this implementation unbiased?  What makes the other implementation  
biased?  What's the crucial difference between the two?

(technical note - #sort_by calculates the key assigned to each  
element only once.  In other words, each element of 'a' is assigned  
one and only one random key, and is then sorted according to that  
key.  See the docs for more info.)

The only difference between the two implementations is the range of  
the random numbers - the first implementation uses a float in the  
range [0, 1.0), the second is an integer from in the range  
(0...a.size).  For sorting purposes, though, the particular range of  
the keys is irrelevant, all that matters is that the keys are  
comparable (you could use strings if you felt like it, too).

This is proven very simply: given a finite set of N keys in the float  
range [0, 1.0), they can be mapped to a finite set of keys in the  
integer range (0...N) without changing their sort order: simply sort  
the float keys, and then enumerate them.  The result is a set of  
integer keys with the identical sort order as the float keys.  The  
reverse mapping is just as simple.

In other words: the implementations are identical with respect to the  
sort.  The sort is the only re-ordering performed, and so is  
responsible for any bias in the resulting shuffle.  And therefore, if  
there's a bias in one implementation, there's a bias with the other  
one as well.

Regarding whether or not bias actually exists, I didn't say that the  
sorting implementation is definitely going to be biased, merely that  
implementations which rely on a sort are potentially biased  
*depending on the underlying sorting algorithm*.

This was the precisely the problem in the Haskell paper - it assumed  
a stable sort, which introduces bias.  I believe Ruby uses quicksort,  
which isn't stable, and so the method isn't biased.  However, Ruby  
makes no guarantees about the underlying sorting algorithm; were it  
ever to change, it would potentially introduce bias into the  
shuffle.  One such change might be to introduce a sort which has  
locally-stable conditions, such as a quicksort that optimises using  
insertion sort on small partitions.

If (as we're assuming) an unbiased shuffle is important, then it's  
dangerous to rely on non-guaranteed implementation details to ensure  
that.

> It's still less effecient than a Fisher-Yates
> shuffle (at least in big-O notation; as a practical matter, I'd  
> need to see
> some benchmarks)

Acceptable benchmarks always depend on your intended use, but here  
are some simple wall-clock ones one some large arrays (large arrays  
because that's the case where performance might matter).  The results  
are that f_y takes ~75% of the time of sort in the first case, and  
~60% of the time in the second case, which is nice, but not nearly  
what the complexity predicts (is it ever?)  Makes me wonder what the  
difference would be if they were in C.

matthew smillie


# a = (1..100000).map { rand(100000) }
                 user     system      total        real
in-place f_y   1.230000   0.010000   1.240000 (  1.320484)
  f_y_shuffle   1.370000   0.010000   1.380000 (  1.471718)
sort_shuffle   1.870000   0.040000   1.910000 (  2.004320)

# a = (1..1000000).map { rand(1000000) }
                 user     system      total        real
in-place f_y  12.670000   0.080000  12.750000 ( 13.202975)
  f_y_shuffle  13.980000   0.150000  14.130000 ( 15.091745)
sort_shuffle  23.260000   0.410000  23.670000 ( 24.704380)

require 'benchmark'
include Benchmark

class Array
   def shuffle!
     self.each_index { |i|
       r = rand(size - i) + i
       tmp = self[i]
       self[i] = self[r]
       self[r] = tmp
     }
   end
   def shuffle
     values_at(*(0..size-1).to_a.shuffle!)
   end
   def sort_shuffle
     sort_by{rand}
   end
end

# test array
a = (1..100000).map { rand(100000) }
bm(10) { |b|
   b.report("in-place f_y") { a.shuffle! }
   b.report(" f_y_shuffle") { a.shuffle }
   b.report("sort_shuffle") { a.sort_shuffle }
}
a = (1..1000000).map { rand(1000000) }
bm(10) { |b|
   b.report("in-place f_y") { a.shuffle! }
   b.report(" f_y_shuffle") { a.shuffle }
   b.report("sort_shuffle") { a.sort_shuffle }
}