On Thu, 18 May 2006, Ohad Lutzky wrote: >> none of the db stuff is transactional, so it's quite possible for >> strange >> things to happen. for instance >> >> 1) >>> delete_queue_entry queue_entry[:id] >> # >> # job removed. process dies. job lost. >> # > > I see. So I'd essentially have to remove the job from the queue (it's most > definetely not queued anymore), check if it's running, and only if it is - > add it to the running-jobs list? (And otherwise, into some sort of > 'jobs-we-lost' list kept in the daemon)? Where do transactions come into > play here? deleting a job from the queue, forking the child, and adding the cid to the running_jobs table need to be transaction protected. problems: - you cannot fork with an open connection http://lists.mysql.com/mysql/1127 you need a way to do this: - start a child process - get it's pid - start a transaction - remove job from queue - add pid to running_jobs - start job running - commit transaction ruby queue does exactly this to avoid the db/fork issue by setting up a drb process manger. the process manager starts a pipe to a bash login shell (so user gets nice environment) and returns the pid. during the transaction the command to be run is sent down the pipe. in this way we can get the pid of the child and yet not start it 'running'. i think this is the only way to avoid forking in the transaction but, if you do not use transactions you'll always have border problems. - your reap model is async to, in addition to the above, we have to be prepared for __any__ transaction to be interupted by a signal handler. this is, ihmo, impossible for all but the most gifted of programers and not worth the hassle. i know i could not make this model work correctly. fwiw my drb process manger in ruby queue is only about 100 lines of code and completely eliminates the above issue. it's used like this (psuedo code) job_manager = JobManager.new # drb server running on localhost in another process loop do while number_of_running_jobs < max_number_of_running_jobs child = job_manager.new_child db.transaction do remove_from_queue child.pid child.run command add_to_running_jobs child.pid end end status = job_manager.wait -1 # wait for anyone to finish db.transaction do job_is_finished status.pid, status.exit_status end end note that all the forking and waiting occurs in another process! this __hugely__ simplifies the process of making the job queue service correct and robust. > However, what I'm more seriously concerned about is the apparent > 'reparenting' of the job. I'll try and be a bit more clear about what > happened: Two jobs are currently running on the server, and this is the > output of pstree on the pid of the daemon: > > $ pstree -p 24788 > ruby(24788)─┬─ruby(4854)───bash(4855)───MATLAB(4856)───matlab_helper(4936) > └─ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929) > > So 24788 is my daemon, 4854 and 24859 are the forks (those are the PIDs) in > the database, and 4856 and 24862 are eating up my CPU :) > > Now, here's an analog of how it was in the broken state: > > bash(4855)───MATLAB(4856)───matlab_helper(4936) > > ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929) > > Meaning - parent dead (but its orphans live, now reparanted to init for some > reason), and one of the bashes got reparanted to init. Kinda looks like > matlab was run by a user, but it wasn't (verified). it looks like your daemon aborted and was restarted and, because it's munged the nice child handling of ruby, children were orphaned. when children are orphaned they become children of init (i guess you knew that...). anyhow, matlab and idl tweak signals themselves so i guess so it can become very ugly to mange children with signals. ruby queue handles this by making sure all children are collected in the normal way or, if that fails, killing them with -9 and then collecting them on exit - in otherwords it simply refuses to exit until all children are collected. to me this really just looks like the daemon was killed, didn't collect every child on exit, and therfore orphans were picked up by init. > Unless I'm missing something, this doesn't have to do with my lack of > transactions, or things going out of sync... any ideas? yes - it's a separate issue. regardless - making an external process manger willl greatly simplify the whole program wrst both transactions and nice child waiting behaviour. regards. -a -- be kind whenever possible... it is always possible. - h.h. the 14th dali lama