Quoting Matthew Smillie <M.B.Smillie / sms.ed.ac.uk>:

> On Jun 20, 2006, at 4:08, Daniel Martin wrote:
>> Note that the discussion there does not apply to the implementation  
>> quoted, but rather to this implementation:
>>
>> a.sort_by{rand(a.size)}.each {|e| print e , ' '}
>>
>> which I think we can all see the problem with.  Absent any  
>> collisions in the values chosen by rand each time through the  
>> block, the short implementation quoted above is unbiased.
>
> You don't provide anything to back up this assertion.  What makes
> this implementation unbiased?  What makes the other implementation
> biased?  What's the crucial difference between the two?

I'm sorry; I thought it would be obvious to anyone who had read the Haskell
text.  In any case, the problem is the likelihood of collisions.  When you use
just {rand}, the chance of collisions is very low, on the order of 2**(- how
many bits are in the random number implementation).  When you use
{rand(a.size)} the chance of collision is comparatively high.

When keys collide, then the search order you get depends on the implementation
of the sorting algorithm; note that it doesn't matter whether or not the sort
is a stable sort - you get whatever order the sort function ends up with when
given identical keys.  (note that it appears that ruby's sort_by is 
stable, but
that's irrelevant; you could replace "sort_by" with "reverse.sort_by" 
and you'd
still get bias, just bias in a different way)

When keys don't collide, the sorting algorithm is irrelevant. (well, so 
long as
the algorithm works at all)  All sorting algorithms will end up putting the
list in the same order given that collection of keys.

Now, some empirical numbers as evidence.  First let's try the algorithm 
I claim
is unbiased.  Let's see what the first number is when shuffling (1..3):

irb(main):024:0> a = Hash.new(0); 99999.times { a[(1..3).sort_by{rand}[0]] +=
1};  a
=> {1=>33348, 2=>33393, 3=>33258}

Note that each value has approximately the same likelihood, as we would 
expect.
I suppose we could go and do a chi-squared test on the results, but I'm 
willing
to just eyeball it and say that that's a good result.  I'll note that the
average initial value is 1.99909999099991, which is really close to 2.

Now let's try the algorithm which the Haskell paper was talking about, 
and which
I said was biased:

irb(main):028:0> a = Hash.new(0); 99999.times { 
a[(1..3).sort_by{rand(3)}[0]] +=
1};  a
=> {1=>52017, 2=>18490, 3=>29492}

Note how comparatively unlikely it is to get a 2 first.  The average initial
value is around 1.77.

Note that this bias lessens as the array (and therefore, the pool of 
keys) grows
larger.  The Haskell paper, when demonstrating bias, used an array of 
size two.

>         For sorting purposes, though, the particular range of
> the keys is irrelevant, all that matters is that the keys are
> comparable (you could use strings if you felt like it, too).

Except, as I've pointed out, no.  It's all about how likely you are to get key
collisions.  With a limited integer range to choose from, the chance is
comparatively high.

> Regarding whether or not bias actually exists, I didn't say that the
> sorting implementation is definitely going to be biased, merely that
> implementations which rely on a sort are potentially biased
> *depending on the underlying sorting algorithm*.

Actually, the sort_by shuffling algorithm with a high chance of key 
collision is
biased regardless of the details of the underlying sorting algorithm.  Let's
take a two element list, as the haskell paper did, and consider two possible
sorting algorithms for sorting two-element lists:

ALGORITM A: if a[0].key >= a[1].key then a.reverse else a end
ALGORITM B: if a[0].key >  a[1].key then a.reverse else a end

In fact, every sorting algorithm is going to be equivalent to one of 
these when
applied to a two element list.  (because, given equal keys, there are only two
choices for what to do with the list)

Now, let's look at what happens when we use %w(x y).sort_by{rand(2)}:

1/4 of the time: The keys chosen are [0,1] - both algorithms produce %w(x y)
1/4 of the time: The keys chosen are [1,0] - both algorithms produce %w(y x)
1/2 of the time: The keys chosen are the same.
                  Algorithm A produces %w(y x)
                  Algorithm B produces %w(x y)

That is, a system with an algorithm A-type sort will produce %w(y x)
three-fourths of the time.  A system with an algorithm B-type sort will 
produce
%w(x y) three-fourths of the time.  Both systems are biased, and only 
algorithm
B-type systems have a stable sort.

Note that if we threw out and re-shuffled all the cases where there's any key
collision, we'd have an unbiased shuffle regardless of sorting algorithms.
We'd also have to reshuffle a two-element list an average of two times each
time we wanted a proper shuffle, a three-element list 4.5 times on average, a
four element list 10.67 times, and an n-element list n**n / n! times.  Not
something you want to do.

>> It's still less effecient than a Fisher-Yates
>> shuffle (at least in big-O notation; as a practical matter, I'd  need to see
>> some benchmarks)
>
> Acceptable benchmarks always depend on your intended use, but here
> are some simple wall-clock ones one some large arrays (large arrays
> because that's the case where performance might matter).  The results
> are that f_y takes ~75% of the time of sort in the first case, and
> ~60% of the time in the second case, which is nice, but not nearly
> what the complexity predicts (is it ever?)  Makes me wonder what the
> difference would be if they were in C.

Ew.  I meant "efficient".  In any case, it's nice to see results that 
agree with
the theory.  I has suspected that the sort_by method might prove faster 
because
a large proportion of its computation is done in C (the actual sort) as
compared to the all-ruby F-Y shuffle.

I gotta get a ruby translation of my sig.

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/