On Thu, 18 May 2006, Ohad Lutzky wrote:

>> none of the db stuff is transactional, so it's quite possible for
>> strange
>> things to happen.  for instance
>>
>> 1)
>>>              delete_queue_entry queue_entry[:id]
>>    #
>>    # job removed.  process dies.  job lost.
>>    #
>
> I see. So I'd essentially have to remove the job from the queue (it's most
> definetely not queued anymore), check if it's running, and only if it is -
> add it to the running-jobs list? (And otherwise, into some sort of
> 'jobs-we-lost' list kept in the daemon)? Where do transactions come into
> play here?

deleting a job from the queue, forking the child, and adding the cid to the
running_jobs table need to be transaction protected.

problems:

   - you cannot fork with an open connection

       http://lists.mysql.com/mysql/1127

       you need a way to do this:

         - start a child process
         - get it's pid
         - start a transaction
         - remove job from queue
         - add pid to running_jobs
         - start job running
         - commit transaction

       ruby queue does exactly this to avoid the db/fork issue by setting up a
       drb process manger.  the process manager starts a pipe to a bash login
       shell (so user gets nice environment) and returns the pid.  during the
       transaction the command to be run is sent down the pipe.  in this way we
       can get the pid of the child and yet not start it 'running'.  i think
       this is the only way to avoid forking in the transaction but, if you do
       not use transactions you'll always have border problems.

   - your reap model is async to, in addition to the above, we have to be
     prepared for __any__ transaction to be interupted by a signal handler.
     this is, ihmo, impossible for all but the most gifted of programers and
     not worth the hassle.  i know i could not make this model work correctly.

fwiw my drb process manger in ruby queue is only about 100 lines of code and
completely eliminates the above issue.  it's used like this (psuedo code)

   job_manager = JobManager.new  # drb server running on localhost in another process

   loop do

     while number_of_running_jobs < max_number_of_running_jobs
       child = job_manager.new_child

       db.transaction do
         remove_from_queue child.pid
         child.run command
         add_to_running_jobs child.pid
       end
     end

     status = job_manager.wait -1 # wait for anyone to finish

     db.transaction do
       job_is_finished status.pid, status.exit_status
     end

   end

note that all the forking and waiting occurs in another process!  this
__hugely__ simplifies the process of making the job queue service correct and
robust.


> However, what I'm more seriously concerned about is the apparent
> 'reparenting' of the job. I'll try and be a bit more clear about what
> happened: Two jobs are currently running on the server, and this is the
> output of pstree on the pid of the daemon:
>
> $ pstree -p 24788
> ruby(24788)─┬─ruby(4854)───bash(4855)───MATLAB(4856)───matlab_helper(4936)
>            └─ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929)
>
> So 24788 is my daemon, 4854 and 24859 are the forks (those are the PIDs) in
> the database, and 4856 and 24862 are eating up my CPU :)
>
> Now, here's an analog of how it was in the broken state:
>
> bash(4855)───MATLAB(4856)───matlab_helper(4936)
>
> ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929)
>
> Meaning - parent dead (but its orphans live, now reparanted to init for some
> reason), and one of the bashes got reparanted to init. Kinda looks like
> matlab was run by a user, but it wasn't (verified).

it looks like your daemon aborted and was restarted and, because it's munged
the nice child handling of ruby, children were orphaned.  when children are
orphaned they become children of init (i guess you knew that...).  anyhow,
matlab and idl tweak signals themselves so i guess so it can become very ugly
to mange children with signals.  ruby queue handles this by making sure all
children are collected in the normal way or, if that fails, killing them with
-9 and then collecting them on exit - in otherwords it simply refuses to exit
until all children are collected.

to me this really just looks like the daemon was killed, didn't collect every
child on exit, and therfore orphans were picked up by init.

> Unless I'm missing something, this doesn't have to do with my lack of
> transactions, or things going out of sync... any ideas?

yes - it's a separate issue.  regardless - making an external process manger
willl greatly simplify the whole program wrst both transactions and nice child
waiting behaviour.

regards.

-a
-- 
be kind whenever possible... it is always possible.
- h.h. the 14th dali lama