Quoting Matthew Smillie <M.B.Smillie / sms.ed.ac.uk>: > On Jun 20, 2006, at 4:08, Daniel Martin wrote: >> Note that the discussion there does not apply to the implementation >> quoted, but rather to this implementation: >> >> a.sort_by{rand(a.size)}.each {|e| print e , ' '} >> >> which I think we can all see the problem with. Absent any >> collisions in the values chosen by rand each time through the >> block, the short implementation quoted above is unbiased. > > You don't provide anything to back up this assertion. What makes > this implementation unbiased? What makes the other implementation > biased? What's the crucial difference between the two? I'm sorry; I thought it would be obvious to anyone who had read the Haskell text. In any case, the problem is the likelihood of collisions. When you use just {rand}, the chance of collisions is very low, on the order of 2**(- how many bits are in the random number implementation). When you use {rand(a.size)} the chance of collision is comparatively high. When keys collide, then the search order you get depends on the implementation of the sorting algorithm; note that it doesn't matter whether or not the sort is a stable sort - you get whatever order the sort function ends up with when given identical keys. (note that it appears that ruby's sort_by is stable, but that's irrelevant; you could replace "sort_by" with "reverse.sort_by" and you'd still get bias, just bias in a different way) When keys don't collide, the sorting algorithm is irrelevant. (well, so long as the algorithm works at all) All sorting algorithms will end up putting the list in the same order given that collection of keys. Now, some empirical numbers as evidence. First let's try the algorithm I claim is unbiased. Let's see what the first number is when shuffling (1..3): irb(main):024:0> a = Hash.new(0); 99999.times { a[(1..3).sort_by{rand}[0]] += 1}; a => {1=>33348, 2=>33393, 3=>33258} Note that each value has approximately the same likelihood, as we would expect. I suppose we could go and do a chi-squared test on the results, but I'm willing to just eyeball it and say that that's a good result. I'll note that the average initial value is 1.99909999099991, which is really close to 2. Now let's try the algorithm which the Haskell paper was talking about, and which I said was biased: irb(main):028:0> a = Hash.new(0); 99999.times { a[(1..3).sort_by{rand(3)}[0]] += 1}; a => {1=>52017, 2=>18490, 3=>29492} Note how comparatively unlikely it is to get a 2 first. The average initial value is around 1.77. Note that this bias lessens as the array (and therefore, the pool of keys) grows larger. The Haskell paper, when demonstrating bias, used an array of size two. > For sorting purposes, though, the particular range of > the keys is irrelevant, all that matters is that the keys are > comparable (you could use strings if you felt like it, too). Except, as I've pointed out, no. It's all about how likely you are to get key collisions. With a limited integer range to choose from, the chance is comparatively high. > Regarding whether or not bias actually exists, I didn't say that the > sorting implementation is definitely going to be biased, merely that > implementations which rely on a sort are potentially biased > *depending on the underlying sorting algorithm*. Actually, the sort_by shuffling algorithm with a high chance of key collision is biased regardless of the details of the underlying sorting algorithm. Let's take a two element list, as the haskell paper did, and consider two possible sorting algorithms for sorting two-element lists: ALGORITM A: if a[0].key >= a[1].key then a.reverse else a end ALGORITM B: if a[0].key > a[1].key then a.reverse else a end In fact, every sorting algorithm is going to be equivalent to one of these when applied to a two element list. (because, given equal keys, there are only two choices for what to do with the list) Now, let's look at what happens when we use %w(x y).sort_by{rand(2)}: 1/4 of the time: The keys chosen are [0,1] - both algorithms produce %w(x y) 1/4 of the time: The keys chosen are [1,0] - both algorithms produce %w(y x) 1/2 of the time: The keys chosen are the same. Algorithm A produces %w(y x) Algorithm B produces %w(x y) That is, a system with an algorithm A-type sort will produce %w(y x) three-fourths of the time. A system with an algorithm B-type sort will produce %w(x y) three-fourths of the time. Both systems are biased, and only algorithm B-type systems have a stable sort. Note that if we threw out and re-shuffled all the cases where there's any key collision, we'd have an unbiased shuffle regardless of sorting algorithms. We'd also have to reshuffle a two-element list an average of two times each time we wanted a proper shuffle, a three-element list 4.5 times on average, a four element list 10.67 times, and an n-element list n**n / n! times. Not something you want to do. >> It's still less effecient than a Fisher-Yates >> shuffle (at least in big-O notation; as a practical matter, I'd need to see >> some benchmarks) > > Acceptable benchmarks always depend on your intended use, but here > are some simple wall-clock ones one some large arrays (large arrays > because that's the case where performance might matter). The results > are that f_y takes ~75% of the time of sort in the first case, and > ~60% of the time in the second case, which is nice, but not nearly > what the complexity predicts (is it ever?) Makes me wonder what the > difference would be if they were in C. Ew. I meant "efficient". In any case, it's nice to see results that agree with the theory. I has suspected that the sort_by method might prove faster because a large proportion of its computation is done in C (the actual sort) as compared to the all-ruby F-Y shuffle. I gotta get a ruby translation of my sig. -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/