i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue.  recently i started seeing random core dumps from them.  i've isolated
the bit of code that causes the core dumps to occur - it's this

   class JobRunner
#{{{
     attr :job
     attr :jid
     attr :cid
     attr :shell
     attr :command
     def initialize job
#{{{
       @job = job
       @jid = job['jid']
       @command = job['command']
       @shell = job['shell'] || 'bash'
       @r,@w = IO.pipe
       @cid =
         Util::fork do
           @w.close
           STDIN.reopen @r

         if $want_to_core_dump

           keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
           256.times do |fd|
             next if keep.include? fd
             begin
               IO::new(fd).close
             rescue Errno::EINVAL, Errno::EBADF
             end
           end

         end

           if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
           else
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
           end
         end
       @r.close
#}}}
     end
     def run
#{{{
       @w.puts @command
       @w.close
#}}}
     end
#}}}
   end


now heres the tricky bit.  the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't.  the context this
code executes in is complex, but here's the just of it


   sqlite database transaction started - this opens some files like db-journal,
   etc.

   a job is selected from database

     fork job runner - this closes open files except stdin, stdout, stderr, and
     com pipe

   the job pid and other accounting is committed to database


the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly.  these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.

back to the core dump...

basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT.  if i do NOT close them the parent never core dumps.  the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD.  i have some signal handlers setup for stopping/restarting that look
exactly like this:


       trap('SIGHUP') do
         $signaled = $sighup = true
         warn{ "signal <SIGHUP>" }
       end
       trap('SIGTERM') do
         $signaled = $sigterm = true
         warn{ "signal <SIGTERM>" }
       end
       trap('SIGINT') do
         $signaled = $sigint = true
         warn{ "signal <SIGINT>" }
       end

in my event loop i obviously take appropriate steps for the $sigXXX.

as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent.  i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.

so - here's the question:  what kind of badness could closing fd's be causing
in the PARENT?   i'm utterly confused at this point and don't really know
where to look next...  could this be a ruby bug or am i just breaking some
unix law and getting bitten.

thanks for any advice.

kind regards.

-a
--
===============================================================================
| EMAIL   :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE   :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it. 
|   --Dogen
===============================================================================