Olaf Klischat <klischat / cs.tu-berlin.de> writes:

> Paolo Capriotti <p.capriotti / gmail.com> writes:
>
>> It's wrong anyways. You both are assuming that all these events are
>> independent, and multiply their probability, but they are not. One
>> neat way to see this is remarking that you didn't use the fact that
>> the intervals are 5000000. If there were only one such interval (that
>> is, if you were doing an 1 out of 200 sampling), the probability would
>> have been obviously 1. If your formula were correct, I could apply it
>> to this case, obtaining a contradiction.
>
> Yeah. You have a point there ;).
>
> I still think that 0.3688 and 0.3688**20 are not far off because the
> interval size (200) and number (20) is much smaller than MEMBERS and
> LIMIT, so the occurance of an event like "the sampling contains one
> number from a given (preselected) interval" does not influence the
> probability of another such event (for another, disjunct interval)
> very much. Right?

I've pondered some more and I think, considering "event dependency"
(i.e. changed probabilities once you've selected some numbers), the
actual probability that exactly one number occurs in a given
200-numbers interval isn't (199/200)**199=0.36880183, but
(1-4_999_999/999_999_999)*
(1-4_999_999/999_999_998)*
(1-4_999_999/999_999_997)*
..
..
.. *
(1-4_999_999/999_999_801)
=  0.368801868

(so the approximation has an error of only 1e-7), or in the general
case:

# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
  (1...intervalsize).inject(intervalsize * members/limit){|x,i|
    x * (1-(members-1)/(limit-i))
  }
end


I've now written a "check" script that gets a sample on stdin and
determines this value (all consecutive intervals of size limit/members
are checked), and compares it to the theoretically expected one:


--------------------------------
#!/usr/bin/ruby

# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
  (1...intervalsize).inject(intervalsize * members/limit){|x,i|
    x * (1-(members-1)/(limit-i))
  }
end

members,limit = ARGV.map{|x|x.to_i}
warn "#{limit} mod #{members} != 0 -- inaccurate results" if limit%members!=0

interval = limit/members

puts "expected: #{p(members.to_f,limit.to_f,interval)}"

found=0
begin
  last=0; count=0;
  loop do
    num=STDIN.readline.to_i
    new=num/interval
    if new==last then
      count+=1
    else
      last=new
      found+=1 if count==1
      count=1
    end
  end
rescue EOFError
end
found+=1 if count==1

puts "actual:   #{found.to_f/(limit/interval)}"
--------------------------------



For the big_sample.txt here, this gives:

olaf@tack:~/doc/mydocs/ruby/quiz$ ./check 5_000_000 1_000_000_000 <big_sample.txt
expected: 0.368801867760757
actual:   0.368865
olaf@tack:~/doc/mydocs/ruby/quiz$ 


Other runs:

olaf@tack:~/doc/mydocs/ruby/quiz$ ./sample 5000 70000 | ./check 5000 70000
expected: 0.381630036177205
actual:   0.386
olaf@tack:~/doc/mydocs/ruby/quiz$ 

olaf@tack:~/doc/mydocs/ruby/quiz$ ./sample 1 100 | ./check 1 100
expected: 1.0
actual:   1.0
olaf@tack:~/doc/mydocs/ruby/quiz$ 


The "cheated" samples that have one value in each interval will always
yield actual: 1.0.

All things considered, this seems to be quite a good indicator to tell
the really random solutions from the not-so-random ones :)

Olaf