```On Jun 20, 2006, at 4:08, Daniel Martin wrote:

> Quoting Matthew Smillie <M.B.Smillie / sms.ed.ac.uk>:
>
>> On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:
>>> I think it's a little bit to simple to be put in the standard lib.
>>>
>>> a.sort_by{rand}.each {|e| print e , ' '}
>>
>> But it turns out to not be that simple.  This can give a biased
>> sort depending on the underlying sort algorithm,
>>
>> The linked Haskell implementation also gives a very thorough
>> discussion, including a demonstration of how a stable sorting
>> algorithm will yield biased results.
>
> Note that the discussion there does not apply to the implementation
> quoted, but rather to this implementation:
>
> a.sort_by{rand(a.size)}.each {|e| print e , ' '}
>
> which I think we can all see the problem with.  Absent any
> collisions in the values chosen by rand each time through the
> block, the short implementation quoted above is unbiased.

You don't provide anything to back up this assertion.  What makes
this implementation unbiased?  What makes the other implementation
biased?  What's the crucial difference between the two?

(technical note - #sort_by calculates the key assigned to each
element only once.  In other words, each element of 'a' is assigned
one and only one random key, and is then sorted according to that
key.  See the docs for more info.)

The only difference between the two implementations is the range of
the random numbers - the first implementation uses a float in the
range [0, 1.0), the second is an integer from in the range
(0...a.size).  For sorting purposes, though, the particular range of
the keys is irrelevant, all that matters is that the keys are
comparable (you could use strings if you felt like it, too).

This is proven very simply: given a finite set of N keys in the float
range [0, 1.0), they can be mapped to a finite set of keys in the
integer range (0...N) without changing their sort order: simply sort
the float keys, and then enumerate them.  The result is a set of
integer keys with the identical sort order as the float keys.  The
reverse mapping is just as simple.

In other words: the implementations are identical with respect to the
sort.  The sort is the only re-ordering performed, and so is
responsible for any bias in the resulting shuffle.  And therefore, if
there's a bias in one implementation, there's a bias with the other
one as well.

Regarding whether or not bias actually exists, I didn't say that the
sorting implementation is definitely going to be biased, merely that
implementations which rely on a sort are potentially biased
*depending on the underlying sorting algorithm*.

This was the precisely the problem in the Haskell paper - it assumed
a stable sort, which introduces bias.  I believe Ruby uses quicksort,
which isn't stable, and so the method isn't biased.  However, Ruby
makes no guarantees about the underlying sorting algorithm; were it
ever to change, it would potentially introduce bias into the
shuffle.  One such change might be to introduce a sort which has
locally-stable conditions, such as a quicksort that optimises using
insertion sort on small partitions.

If (as we're assuming) an unbiased shuffle is important, then it's
dangerous to rely on non-guaranteed implementation details to ensure
that.

> It's still less effecient than a Fisher-Yates
> shuffle (at least in big-O notation; as a practical matter, I'd
> need to see
> some benchmarks)

Acceptable benchmarks always depend on your intended use, but here
are some simple wall-clock ones one some large arrays (large arrays
because that's the case where performance might matter).  The results
are that f_y takes ~75% of the time of sort in the first case, and
~60% of the time in the second case, which is nice, but not nearly
what the complexity predicts (is it ever?)  Makes me wonder what the
difference would be if they were in C.

matthew smillie

# a = (1..100000).map { rand(100000) }
user     system      total        real
in-place f_y   1.230000   0.010000   1.240000 (  1.320484)
f_y_shuffle   1.370000   0.010000   1.380000 (  1.471718)
sort_shuffle   1.870000   0.040000   1.910000 (  2.004320)

# a = (1..1000000).map { rand(1000000) }
user     system      total        real
in-place f_y  12.670000   0.080000  12.750000 ( 13.202975)
f_y_shuffle  13.980000   0.150000  14.130000 ( 15.091745)
sort_shuffle  23.260000   0.410000  23.670000 ( 24.704380)

require 'benchmark'
include Benchmark

class Array
def shuffle!
self.each_index { |i|
r = rand(size - i) + i
tmp = self[i]
self[i] = self[r]
self[r] = tmp
}
end
def shuffle
values_at(*(0..size-1).to_a.shuffle!)
end
def sort_shuffle
sort_by{rand}
end
end

# test array
a = (1..100000).map { rand(100000) }
bm(10) { |b|
b.report("in-place f_y") { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report("sort_shuffle") { a.sort_shuffle }
}
a = (1..1000000).map { rand(1000000) }
bm(10) { |b|
b.report("in-place f_y") { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report("sort_shuffle") { a.sort_shuffle }
}

```