On Fri, 20 Aug 2004, Lennon Day-Reynolds wrote:

> Have you considered using DRb, instead of raw pipes, to coordinate the
> work on Windows? Assuming this is more of a load-balancing system
> than, say, a massively-parallel cluster environment, the overhead of
> DRb marshalling and unmarshalling shouldn't be a big deal, and you
> could probably just make your "front" object a simple job controller,
> which could spawn processes, pass them input, and send output back to
> the client.
>
> Just a thought.


my system has n feeding procsses 'competing' the process jobs from an nfs
mounted priority queue.  this obviously involves some sort of nfs db and/or
locking.  this is all provided by sqlite and some other classes i've written
(lockfile on raa).  the advantages this has are

   - no single point of failure.  if one node stay up the sytem continues

   - no networking needed (well NFS but that hardly counts).  this is important
     because it means not ports - and that means no sysads.  one thing i am
     striving for is that a user should be able to set a cluster up by simply
     running a peice of userland code pointing at an nfs mounted directory in
     under five minutes.  the niche this aims for is something less complicated
     that sun grid engine, or other systems which use daemons to communicate
     jobs and to schedule them, and something more that simply spawn jobs by
     spawning ssh sessions all over the place.  to my knowledge there is no
     such system.  and it's tremendously useful in a scientific setting where
     one often just wants to throw 30 nodes at a list of jobs right NOW.

i considered drb for a really long time and it has the following disadvantages
that i can see

   - must open ports.  since sept. 11th. we have only ssh.  period.  ssh
     tunneling is an options but absolutely crazy when once starts considering
     how to keep ssh-agent running across reboots (we must use passpharases)
     without embedding passwords (forbidden and checked for here).  plus the
     number of ssh tunnels needed is n^2 - this gets riduculous when you have
     30 nodes!

   - if you have a scheduler you have a single point of failure.  if all nodes
     can operate as the scheduler you need some sort of distributed locking
     protocl.  you could use the filesystem and nfs safe locks here.  if you
     have nfs safe locks you do not need drb and can simply put the queue in an
     nfs safe db (sqlite) and coordiante all actions via the filesystem.

     of course, you could start using something like a tuple space to
     coordinate - but again you have a single point of failure...

     i cannot see how one can either

       - elimnate a single point of failure using drb

       - make the system decentralized (all daemons are servants) without
         requiring some form of locking - thereby eliminating the need for drb
         in the first place

     - code like it is already written - condor, sge (sun grid engine) and they
       have LOTS of problems.  scheduling is tough.

if you have suggestions i'm all ears.

also, i should point out that virtually every scienfic cluster in our building
already relies on nfs and locking to some degree so while it's true that the
nfs server itself is a single point of failure (and network of course) this
things are already inherent in the system and my code adds no MORE points of
failure.

the system must run in the face of problems or i come in on weekends!  ;-(  so
i will not willingly introduce single points of failure into the system.  the
present only require only that syads come in on the weekend - not me - so i'd
like to keep it that way.


in short i would LOVE to use drb for many reasons, but cannot come up with a
fault tolerant way to deal with ssh tunneling, scheduling, and locking that
does make nfs mounted work queues a simpler solution in the process.

thoughts?

-a
--
===============================================================================
| EMAIL   :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE   :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it. 
|   --Dogen
===============================================================================