On Wed, 25 Aug 2004, Martin DeMello wrote:

> Ara.T.Howard / noaa.gov wrote:
>> i have an application which does alot of byte range locking on an nfs
>> mounted file.  twice now, i've seen a 'dead' lock appear on our nfs
>> server - a lock held by a non-existent pid.  none of the processes in
>> question have been dying
>
> When you figure it out, could you post a followup here? Sounds like
> something it'd be useful to know about.
>
> martin

it's figured out.  i have it coded in a nice and ugly fashion and have been
testing it for the last 24 hours with a bit of code that 'breaks' to lock
every few seconds and forces recovery in the clients.  note that the entire
thing is an emergency only procedure that takes place ONLY when the system is
already broken (hung locks) and so the solution doesn't have to be 'perfect'.
i'm not talking about race conditions, just that it's difficult for one system
to recover the locking atomically in the presence of broken locks (for obvious
reasons) and, therefore, to ensure not killing some remote client.  in my case
this is ideal, my remote clients are remote imorrtal daemons that will restart
if killed and, in fact, i want this to happen.  what i don't want to happen is
for my entire system to hang.  i think it's acceptable to say that - iff your
nfs locking implementation breaks my code may too, however it recovers
afterwards automatically.  so, in that context - here is the basic solution


here is the basic algorithim

   - an empty monitor file will be used in conjunction with the file in
     question.  the reason we need an extra file is so we can make guaruntees
     about the way in which it is updated (mtime).  this file is kept in a
     directory along side the file in question.  this makes recovery much
     easier (atomic).

   - apply your lock type (write/read) to the monitor file in non-blocking
     fashion.

       if lock succeeded

         procede to apply the same lock type, also in non-blocking fashion, to
         the file in question.

         if the lock on the file in question succeeds then start a thread which
         will loop touching the monitor file to keep it 'fresh' - say every ten
         seconds.  and procede to use the file.  note that it's critical to use
         non-blocking locks since blocking locks (in ruby) will stop this
         thread!  if you must use blocking locks then a process must be forked
         to keep the monitor file fresh and you cannot use a thread.  ensure
         killing this thread/process

         if the lock on the file in question fails raise an error - the protocol
         has failed for some reason.

       if lock failed

         if the monitor file is fresh simply sleep a bit and retry.  this will
         be the normal execution path if lockd is working.

         else if the monitor file is stale one of two things must be true:

           - a process holding the lock has died and the nfs client/server have
             not managed to clean things up (hung lock - lockd bug)

           - a process holding the lock is partitioned on a slow network,
             running on a frozen cpu, or somehow has the lock but cannot keep
             the monitor file fresh.  there is a chance that recovery may kill
             this process - but it is already sick and hanging the system.

           in either case we attempt lockd recovery taking the risk of killing
           a sick remote client.  note that MANY remote clients might realize
           the hung situation at once, and so they themselves need to serialize
           recovery without using fctnl locks since those locks may be hung!

           lockd recovery involves:

             - mark recovery start time

             - create an nfs safe lockfile (my lockfile lib), this is a
               'blocking' (poll/sleep) operation to ensure only one remote
               process is attempting recovery at a time.

             - see if someone else has already recovered (flag file exists with
               timestamp greater than recovery start time).  iff so quit
               recovery and go back to attempting to get the first lock

             - since we have the lockfile and noone else has recovered:

               #
               # prevent new process from using either file
               #
                 mv directory containing files to dir.bak
               #
               # clear lockd locks - (force new inode info)
               #
                 for files in monitor, file in question
                   cp file tmp
                   rm file
                   mv tmp file
                 end
               #
               # mark recovery time
               #
                 touch directory/lockd_recovery
               #
               # restore system
               #
                 mv dir.bak directory

             - retry to aquire lock


in my case all remote clients are prepared to get a suite of errors during
transactions such as Errno::ENOENT, Errno::ESTALE, etc.  the allow quite a few
of these by sleeping and retrying but, eventually give up and die.  the retry
is o.k. because all access to the file must be in a transaction (so by
definition is o.k. to re-execute on failure) and even if many retries are made
and the transaction is lost - the daemon will exit and restart (logging this
info).  this last situation would be bad - but not compared to having the
entire system freeze - again, this is ONLY an emergency effort.


regards.



-a
--
===============================================================================
| EMAIL   :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE   :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it. 
|   --Dogen
===============================================================================