Jacob Fugal <lukfugl / gmail.com> writes:

> On 7/16/05, Olaf Klischat <klischat / cs.tu-berlin.de> wrote:
>> > ezra:~/Sites ez$ head big_sample.txt
>> > 168
>> > 285
>> > 566
>> > 604
>> > 912
>> > 1183
>> > 1335
>> > 1473
>> > 1728
>> > 1919
>> > ezra:~/Sites ez$ tail big_sample.txt
>> > 999998155
>> > 999998313
>> > 999998484
>> > 999998680
>> > 999998825
>> > 999999151
>> > 999999330
>> > 999999465
>> > 999999621
>> > 999999877
>> > ezra:~/Sites ez$
>> 
>> Umm... I'm not sure, but that looks a bit too "equidistant" to be
>> truly random, doesn't it?
>> 
>> The sample being truly random means that the sample should be a truly
>> "drawing without putting back" (e.g. lottery) sample, so each possible
>> sample occurs with equal probability. So a sample like
>> 
>> 0
>> 1
>> 2
>> ..
>> ..
>> ..
>> 4999999
>> 
>> should occur with the same probability as any other more "likely" one.
>
> See my other posts in this thread about the actual probabilities. In
> short, since it's fully random, each possible sampling is as likely as
> any other possible sampling, but the number of samplings including at
> least ten numbers in the 99999xxxx range and at least ten numbers in
> the (00000)xxxx range is *much* higher than the number of samplings
> without numbers in those ranges. So the probability of getting a
> sampling that looks evenly spread out is much more likely than getting
> a sampling that's clustered.

Of course. I didn't mean to say that a really "clustered" sample like
0...499999999 (or any other specific sample) has any significant
probability.

But if you look at Ezra's output:

0
> 168
200
> 285
400
> 566
600
> 604
800
> 912
1000
> 1183
1200
> 1335
1400
> 1473
1600
> 1728
1800
> 1919


> 
> 
> 999998155
999998200
> 999998313
999998400
> 999998484
999998600
> 999998680
999998800
> 999998825
999999000
> 999999151
999999200
> 999999330
999999400
> 999999465
999999600
> 999999621
999999800
> 999999877
1000000000

See? One number per 200-numbers interval. Every time[1]. This hints at
a wrong implementation.

I caught this because I had the same idea first :)

[1]

Unless I'm mistaken, in the 5e6-from-1e9 sampling, the probability
that a sampling contains exactly one number from a given 200-numbers
interval is 200.0*(1/200)*(199.0/200)**199 = 0.3688. The probability
that this happens for 20 such 200-numbers intervals is
0.3688**20 = 2.1e-09.