> do your daemon apparently exited and it doesn't prevent itself from running
> twice.  it should obtain a lockfile or lock itself using posixlock 
> (lockfile
> or posixlock gems on rubyforge) to prevent two instances from ever 
> running
> at the same time.  how is your daemon started/kept alive?

hehe... manually, at this point, so that's not it. Excellent point 
though, and
I will make sure restarting the daemon is lock-protected once automatic 
restart
is in place.

> none of the db stuff is transactional, so it's quite possible for 
> strange
> things to happen.  for instance
> 
> 1)
>>              delete_queue_entry queue_entry[:id]
>    #
>    # job removed.  process dies.  job lost.
>    #

I see. So I'd essentially have to remove the job from the queue (it's 
most
definetely not queued anymore), check if it's running, and only if it is 
-
add it to the running-jobs list? (And otherwise, into some sort of 
'jobs-we-lost' list kept in the daemon)? Where do transactions come into 
play
here?

The main issue you're talking about is what happens when a job dies 
before
getting into the 'running jobs' list, and then a ghost comes into the 
list.
This would indeed stall the queue, but rather than add two mechanisms
(transactions and DRb), I'd rather just have the daemon periodically 
check
that jobs it thinks are running actually are. Some very useful info 
though,
much thanks!

However, what I'm more seriously concerned about is the apparent 
'reparenting'
of the job. I'll try and be a bit more clear about what happened: Two 
jobs
are currently running on the server, and this is the output of pstree on
the pid of the daemon:

$ pstree -p 24788
ruby(24788)─┬─ruby(4854)───bash(4855)───MATLAB(4856)───matlab_helper(4936)
            └─ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929)

So 24788 is my daemon, 4854 and 24859 are the forks (those are the PIDs) 
in the
database, and 4856 and 24862 are eating up my CPU :)

Now, here's an analog of how it was in the broken state:

bash(4855)───MATLAB(4856)───matlab_helper(4936)

ruby(24859)───bash(24861)───MATLAB(24862)───matlab_helper(24929)

Meaning - parent dead (but its orphans live, now reparanted to init for 
some
reason), and one of the bashes got reparanted to init. Kinda looks like 
matlab
was run by a user, but it wasn't (verified).

Unless I'm missing something, this doesn't have to do with my lack of 
transactions, or things going out of sync... any ideas?

-- 
Posted via http://www.ruby-forum.com/.