Ara --

    Random thoughts:

      * It could be a race condition of some sort
      * It could be that closing the file in the child closes it for the
        parent even though closing it for the parent does not close it
        for the child
      * It could be that you omitted a file from your keep list that the
        child actually needs.  It tries to access it, goes boom,... 
      * can you make it happen in a simplified situation (e.g. one
        child, etc.)
      * is it possible to make nfs put the ugly files somewhere you
        can't see them?  I know much of the software I run has lots of
        ugly files (e.g. the web browser cache), but they don't bother
        me because I don't look at them.
      * Instead of specifying the files you want to keep (STDIN, etc)
        could you list the ones you want to close, and narrow the
        problem down that way?

    I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).

      -- MarkusQ



On Thu, 2004-09-16 at 11:54, Ara.T.Howard wrote:
> i've got 30 process running on 30 machines running jobs taken from an nfs mounted
> queue.  recently i started seeing random core dumps from them.  i've isolated
> the bit of code that causes the core dumps to occur - it's this
> 
>    class JobRunner
> #{{{
>      attr :job
>      attr :jid
>      attr :cid
>      attr :shell
>      attr :command
>      def initialize job
> #{{{
>        @job = job
>        @jid = job['jid']
>        @command = job['command']
>        @shell = job['shell'] || 'bash'
>        @r,@w = IO.pipe
>        @cid =
>          Util::fork do
>            @w.close
>            STDIN.reopen @r
> 
>          if $want_to_core_dump
> 
>            keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
>            256.times do |fd|
>              next if keep.include? fd
>              begin
>                IO::new(fd).close
>              rescue Errno::EINVAL, Errno::EBADF
>              end
>            end
> 
>          end
> 
>            if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
>              exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
>            else
>              exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
>            end
>          end
>        @r.close
> #}}}
>      end
>      def run
> #{{{
>        @w.puts @command
>        @w.close
> #}}}
>      end
> #}}}
>    end
> 
> 
> now heres the tricky bit.  the core dump doesn't happen here - it happens at
> some random time later, and then again sometimes it doesn't.  the context this
> code executes in is complex, but here's the just of it
> 
> 
>    sqlite database transaction started - this opens some files like db-journal,
>    etc.
> 
>    a job is selected from database
> 
>      fork job runner - this closes open files except stdin, stdout, stderr, and
>      com pipe
> 
>    the job pid and other accounting is committed to database
> 
> 
> the reason i'm trying to close all the files in the first place is because the
> parent eventually unlinks some of them while the child still has them open -
> this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
> this causes no harm as the child never uses these fds - but with 30 machines i
> i end up with 90 or more .nfsxxxxxxx files lying around looking ugly.  these
> eventually go away when the child exits but some of these children run for 4
> or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
> to be quite large.
> 
> back to the core dump...
> 
> basically if i DO close all the filehandles i'll, maybe, core dump sometime
> later IN THE PARENT.  if i do NOT close them the parent never core dumps.  the
> core dumps are totally random and show nothing in common execpt one thing -
> they all show a signal received in the stack trace - i'm guessing this is
> SIGCHLD.  i have some signal handlers setup for stopping/restarting that look
> exactly like this:
> 
> 
>        trap('SIGHUP') do
>          $signaled = $sighup = true
>          warn{ "signal <SIGHUP>" }
>        end
>        trap('SIGTERM') do
>          $signaled = $sigterm = true
>          warn{ "signal <SIGTERM>" }
>        end
>        trap('SIGINT') do
>          $signaled = $sigint = true
>          warn{ "signal <SIGINT>" }
>        end
> 
> in my event loop i obviously take appropriate steps for the $sigXXX.
> 
> as i said, however, i don't think these are responsible since they don't
> actually get run as these signals are not being sent.  i DO fork for every job
> though so that's why i'm guessing the signal is SIGCHLD.
> 
> so - here's the question:  what kind of badness could closing fd's be causing
> in the PARENT?   i'm utterly confused at this point and don't really know
> where to look next...  could this be a ruby bug or am i just breaking some
> unix law and getting bitten.
> 
> thanks for any advice.
> 
> kind regards.
> 
> -a
> --
> ===============================================================================
> | EMAIL   :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> | PHONE   :: 303.497.6469
> | A flower falls, even though we love it;
> | and a weed grows, even though we do not love it. 
> |   --Dogen
> ===============================================================================