On Wed, 17 May 2006, Ohad Lutzky wrote:

> I'm a Unix admin at a research lab in the technion, and use (and loooove)
> Ruby for all kinds of useful applications I wrote for use in the lab. One of
> them is the Matlab Application Server, a queue/batch manager for Matlab
> jobs.  Old version was perl, text files, and one file per job - new version
> is Rails for the UI (which is on another machine), vanilla Ruby for the
> daemon, and MySQL to communicate between them.
>
> Usually it works spectacularily, and people don't have to lock down the
> Windows workstations overnight (or over many nights) so they finish their
> jobs.  The queue is also rather fair - 2 simultaneous jobs, 1 per each user.
>
> However, today something strange happened. The server was under heavy load
> and using a lot of swap, as usual when two Matlab jobs are running. Two jobs
> were indeed running: pids A and B (A < B). However, they were both by the
> same user (which shouldn't happen). Furthermore, Only B was a son of the
> daemon, whereas A was a son of init.

do your daemon apparently exited and it doesn't prevent itself from running
twice.  it should obtain a lockfile or lock itself using posixlock (lockfile
or posixlock gems on rubyforge) to prevent two instances from ever running
at the same time.  how is your daemon started/kept alive?

> The log file said that earlier, a son C died which wasn't in the SQL table
> of running jobs (dying children are removed from there, because it means
> they finished running).  And weirdest, the SQL table of running jobs
> contained job D, which wasn't run (and belonged to a different user).

none of the db stuff is transactional, so it's quite possible for strange
things to happen.  for instance

1)
>              delete_queue_entry queue_entry[:id]
   #
   # job removed.  process dies.  job lost.
   #


2)
>
>              child_pid = fork do
>                run_worker_process queue_entry[:user][:username],
>                                   queue_entry[:directory]
>              end
   #
   # job dies instantly - before adding it to queue - sigchld is caught,
   # handler triggers, db does not yet contain child_pid.
   #
>              add_running_job_entry queue_entry[:sent_at],
>                                    queue_entry[:user],
>                                    queue_entry[:directory],
>                                    child_pid


etc.

it would be very hard to refactor this code the use transactions since you are
forking and forking is not supported with an open db handle (all writes, for
instance, get flushed on child/parent exit).  in theory, though, you need code
like this all over the place

   db.transaction do
     child_pid = start_child job
     add_running_job_entry child_pid, job
   end

and then the signal handler must be written in such a way that it cannot start
another transaction - this way you'll never get sigchld and find no pid for
the child yet entered.  this gets really tricky.  the way i did it for ruby
queue was to setup a drb daemon that does all the forking/waiting for me.
this way i can operate in the normal syncrhonous fashion instead of in async
mode using signal handlers.  you can read about rq here

   http://www.linuxjournal.com/article/7922
   http://raa.ruby-lang.org/project/rq/

in partucular i talk about a db/forking issue about halfway down the first
article - it's relevant if you have time to read it.

i'm note sure what your setup is, but i suspect rq could serve as your queue
manager quite well - if you aren't distributing jobs to many nodes and run it
only on one node it's a queue manager instead of a cluster manager.

kind regards.

-a
-- 
be kind whenever possible... it is always possible.
- h.h. the 14th dali lama