Bug #1993: IO.select fails when called in multiple threads on 1.8.7p174
http://redmine.ruby-lang.org/issues/show/1993

Author: Daniel Azuma
Status: Open, Priority: Normal
Category: core
ruby -v: ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]

IO#select (Kernel#select) fails when run on different sets of IO objects in different threads. This affects release versions 1.8.7p160, 1.8.7p173, and 1.8.7p174. It does NOT seem to affect recent versions of 1.9.1 that I have tested. It also does NOT affect release version 1.8.7p72. I have not tested 1.8.6 versions. The repro steps have been tested mostly on Mac OS X 10.5.8 on an Intel-based MacBook Pro. I have, however, seen similar behavior on a recent Fedora Linux i686.

To reproduce, run the following script. (Replace the two filenames with distinct known readable files on your system.)

 # Begin code
 
 FILENAME1 = "Rakefile"
 FILENAME2 = "README"
 TWO_THREADS = true
 
 f1 = File.open(FILENAME2)
 f2 = File.open(FILENAME1)
 t1 = Thread.new do
   c1 = 0
   loop do
     c1 += 1
     s1 = IO.select([f1], nil, nil, 0)
     n1 = s1 ? s1.first.size : 0
     puts "t1: num=#{n1} iter=#{c1}"
   end
 end
 t2 = Thread.new do
   c2 = 0
   loop do
     c2 += 1
     s2 = IO.select([f2], nil, nil, 0)
     n2 = s2 ? s2.first.size : 0
     puts "t2: num=#{n2} iter=#{c2}"
   end
 end if TWO_THREADS
 t1.join
 
 # End code

The code simply repeatedly calls IO#select on IO objects known to have readable bytes, either in one thread or two threads. When run on one thread (TWO_THREADS=false), it behaves as expected, printing "num=1" indicating that select has detected the readable stream. However, when run on two threads (TWO_THREADS=true), both threads print "num=0" indicating neither thread is detecting readable information on their streams.

The relevant code appears to be the function rb_thread_schedule() in eval.c, and I believe this issue is related to revision 21165. I haven't been able to untangle everything in this code yet, but here's what I've been able to determine:

* The code that collects file descriptors for the system select() call (lines 11063-11073 of the 1.8.7 branch as of revision 24104) DOES NOT RUN for a given thread unless the thread has a THREAD_STOPPED status at that time (because of line 11051). Therefore, any threads with a THREAD_RUNNABLE status at that time, are effectively shut out of receiving select() results unless their fd lists overlap other threads.

* It appears that the tendency is (given the sample code above) for the next qualifying thread (that is, the thread that will be assigned to the "next" variable later on), to be in the THREAD_RUNNABLE state at this time. Since such threads are shut out of the select() call, they can never be assigned to "th_found" (see lines 11208-11212). As a result, "th_found" is assigned to a later thread in the list, rather than, as appears to be the intent, the first qualifying thread in the list (note the break on line 11214).

* Unfortunately, this mismatches lines 11230ff. Those lines, which choose the "next" thread, always prefer the first thread given equal priority (line 11231). Since "th_found" tends not to be the first qualifying thread, we have a situation where lines 11231 and 11232 are never both true; as a result, th->select_value is never set, and the select calls never succeed.

* The code appeared to work pre-revision-21165 (e.g. 1.8.7p72) because that version of the code set select_value on every qualifying thread, whereas the current code sets it on only one thread.

Here's where I'm unsure about how to proceed with a patch. I would like to move lines 11058 through 11073 to immediately above line 11051. This would add each thread's file descriptors to the select call, regardless of whether the thread has status THREAD_STOPPED or THREAD_RUNNABLE. This change appears to fix the test case above. And I believe it is the correct behavior; however, I'm new to this part of the code and do not have enough understanding of the intent of thread->status to assert that this is correct. I was hoping someone with more knowledge of this area could use this analysis as a starting point.


----------------------------------------
http://redmine.ruby-lang.org