Ara --
Random thoughts:
* It could be a race condition of some sort
* It could be that closing the file in the child closes it for the
parent even though closing it for the parent does not close it
for the child
* It could be that you omitted a file from your keep list that the
child actually needs. It tries to access it, goes boom,...
* can you make it happen in a simplified situation (e.g. one
child, etc.)
* is it possible to make nfs put the ugly files somewhere you
can't see them? I know much of the software I run has lots of
ugly files (e.g. the web browser cache), but they don't bother
me because I don't look at them.
* Instead of specifying the files you want to keep (STDIN, etc)
could you list the ones you want to close, and narrow the
problem down that way?
I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).
-- MarkusQ
On Thu, 2004-09-16 at 11:54, Ara.T.Howard wrote:
> i've got 30 process running on 30 machines running jobs taken from an nfs mounted
> queue. recently i started seeing random core dumps from them. i've isolated
> the bit of code that causes the core dumps to occur - it's this
>
> class JobRunner
> #{{{
> attr :job
> attr :jid
> attr :cid
> attr :shell
> attr :command
> def initialize job
> #{{{
> @job = job
> @jid = job['jid']
> @command = job['command']
> @shell = job['shell'] || 'bash'
> @r,@w = IO.pipe
> @cid =
> Util::fork do
> @w.close
> STDIN.reopen @r
>
> if $want_to_core_dump
>
> keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
> 256.times do |fd|
> next if keep.include? fd
> begin
> IO::new(fd).close
> rescue Errno::EINVAL, Errno::EBADF
> end
> end
>
> end
>
> if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
> exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
> else
> exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
> end
> end
> @r.close
> #}}}
> end
> def run
> #{{{
> @w.puts @command
> @w.close
> #}}}
> end
> #}}}
> end
>
>
> now heres the tricky bit. the core dump doesn't happen here - it happens at
> some random time later, and then again sometimes it doesn't. the context this
> code executes in is complex, but here's the just of it
>
>
> sqlite database transaction started - this opens some files like db-journal,
> etc.
>
> a job is selected from database
>
> fork job runner - this closes open files except stdin, stdout, stderr, and
> com pipe
>
> the job pid and other accounting is committed to database
>
>
> the reason i'm trying to close all the files in the first place is because the
> parent eventually unlinks some of them while the child still has them open -
> this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
> this causes no harm as the child never uses these fds - but with 30 machines i
> i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
> eventually go away when the child exits but some of these children run for 4
> or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
> to be quite large.
>
> back to the core dump...
>
> basically if i DO close all the filehandles i'll, maybe, core dump sometime
> later IN THE PARENT. if i do NOT close them the parent never core dumps. the
> core dumps are totally random and show nothing in common execpt one thing -
> they all show a signal received in the stack trace - i'm guessing this is
> SIGCHLD. i have some signal handlers setup for stopping/restarting that look
> exactly like this:
>
>
> trap('SIGHUP') do
> $signaled = $sighup = true
> warn{ "signal <SIGHUP>" }
> end
> trap('SIGTERM') do
> $signaled = $sigterm = true
> warn{ "signal <SIGTERM>" }
> end
> trap('SIGINT') do
> $signaled = $sigint = true
> warn{ "signal <SIGINT>" }
> end
>
> in my event loop i obviously take appropriate steps for the $sigXXX.
>
> as i said, however, i don't think these are responsible since they don't
> actually get run as these signals are not being sent. i DO fork for every job
> though so that's why i'm guessing the signal is SIGCHLD.
>
> so - here's the question: what kind of badness could closing fd's be causing
> in the PARENT? i'm utterly confused at this point and don't really know
> where to look next... could this be a ruby bug or am i just breaking some
> unix law and getting bitten.
>
> thanks for any advice.
>
> kind regards.
>
> -a
> --
> ===============================================================================
> | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> | PHONE :: 303.497.6469
> | A flower falls, even though we love it;
> | and a weed grows, even though we do not love it.
> | --Dogen
> ===============================================================================