On Wed, 17 May 2006, Ohad Lutzky wrote: > I'm a Unix admin at a research lab in the technion, and use (and loooove) > Ruby for all kinds of useful applications I wrote for use in the lab. One of > them is the Matlab Application Server, a queue/batch manager for Matlab > jobs. Old version was perl, text files, and one file per job - new version > is Rails for the UI (which is on another machine), vanilla Ruby for the > daemon, and MySQL to communicate between them. > > Usually it works spectacularily, and people don't have to lock down the > Windows workstations overnight (or over many nights) so they finish their > jobs. The queue is also rather fair - 2 simultaneous jobs, 1 per each user. > > However, today something strange happened. The server was under heavy load > and using a lot of swap, as usual when two Matlab jobs are running. Two jobs > were indeed running: pids A and B (A < B). However, they were both by the > same user (which shouldn't happen). Furthermore, Only B was a son of the > daemon, whereas A was a son of init. do your daemon apparently exited and it doesn't prevent itself from running twice. it should obtain a lockfile or lock itself using posixlock (lockfile or posixlock gems on rubyforge) to prevent two instances from ever running at the same time. how is your daemon started/kept alive? > The log file said that earlier, a son C died which wasn't in the SQL table > of running jobs (dying children are removed from there, because it means > they finished running). And weirdest, the SQL table of running jobs > contained job D, which wasn't run (and belonged to a different user). none of the db stuff is transactional, so it's quite possible for strange things to happen. for instance 1) > delete_queue_entry queue_entry[:id] # # job removed. process dies. job lost. # 2) > > child_pid = fork do > run_worker_process queue_entry[:user][:username], > queue_entry[:directory] > end # # job dies instantly - before adding it to queue - sigchld is caught, # handler triggers, db does not yet contain child_pid. # > add_running_job_entry queue_entry[:sent_at], > queue_entry[:user], > queue_entry[:directory], > child_pid etc. it would be very hard to refactor this code the use transactions since you are forking and forking is not supported with an open db handle (all writes, for instance, get flushed on child/parent exit). in theory, though, you need code like this all over the place db.transaction do child_pid = start_child job add_running_job_entry child_pid, job end and then the signal handler must be written in such a way that it cannot start another transaction - this way you'll never get sigchld and find no pid for the child yet entered. this gets really tricky. the way i did it for ruby queue was to setup a drb daemon that does all the forking/waiting for me. this way i can operate in the normal syncrhonous fashion instead of in async mode using signal handlers. you can read about rq here http://www.linuxjournal.com/article/7922 http://raa.ruby-lang.org/project/rq/ in partucular i talk about a db/forking issue about halfway down the first article - it's relevant if you have time to read it. i'm note sure what your setup is, but i suspect rq could serve as your queue manager quite well - if you aren't distributing jobs to many nodes and run it only on one node it's a queue manager instead of a cluster manager. kind regards. -a -- be kind whenever possible... it is always possible. - h.h. the 14th dali lama