On Fri, 20 Aug 2004, Lennon Day-Reynolds wrote: > Have you considered using DRb, instead of raw pipes, to coordinate the > work on Windows? Assuming this is more of a load-balancing system > than, say, a massively-parallel cluster environment, the overhead of > DRb marshalling and unmarshalling shouldn't be a big deal, and you > could probably just make your "front" object a simple job controller, > which could spawn processes, pass them input, and send output back to > the client. > > Just a thought. my system has n feeding procsses 'competing' the process jobs from an nfs mounted priority queue. this obviously involves some sort of nfs db and/or locking. this is all provided by sqlite and some other classes i've written (lockfile on raa). the advantages this has are - no single point of failure. if one node stay up the sytem continues - no networking needed (well NFS but that hardly counts). this is important because it means not ports - and that means no sysads. one thing i am striving for is that a user should be able to set a cluster up by simply running a peice of userland code pointing at an nfs mounted directory in under five minutes. the niche this aims for is something less complicated that sun grid engine, or other systems which use daemons to communicate jobs and to schedule them, and something more that simply spawn jobs by spawning ssh sessions all over the place. to my knowledge there is no such system. and it's tremendously useful in a scientific setting where one often just wants to throw 30 nodes at a list of jobs right NOW. i considered drb for a really long time and it has the following disadvantages that i can see - must open ports. since sept. 11th. we have only ssh. period. ssh tunneling is an options but absolutely crazy when once starts considering how to keep ssh-agent running across reboots (we must use passpharases) without embedding passwords (forbidden and checked for here). plus the number of ssh tunnels needed is n^2 - this gets riduculous when you have 30 nodes! - if you have a scheduler you have a single point of failure. if all nodes can operate as the scheduler you need some sort of distributed locking protocl. you could use the filesystem and nfs safe locks here. if you have nfs safe locks you do not need drb and can simply put the queue in an nfs safe db (sqlite) and coordiante all actions via the filesystem. of course, you could start using something like a tuple space to coordinate - but again you have a single point of failure... i cannot see how one can either - elimnate a single point of failure using drb - make the system decentralized (all daemons are servants) without requiring some form of locking - thereby eliminating the need for drb in the first place - code like it is already written - condor, sge (sun grid engine) and they have LOTS of problems. scheduling is tough. if you have suggestions i'm all ears. also, i should point out that virtually every scienfic cluster in our building already relies on nfs and locking to some degree so while it's true that the nfs server itself is a single point of failure (and network of course) this things are already inherent in the system and my code adds no MORE points of failure. the system must run in the face of problems or i come in on weekends! ;-( so i will not willingly introduce single points of failure into the system. the present only require only that syads come in on the weekend - not me - so i'd like to keep it that way. in short i would LOVE to use drb for many reasons, but cannot come up with a fault tolerant way to deal with ssh tunneling, scheduling, and locking that does make nfs mounted work queues a simpler solution in the process. thoughts? -a -- =============================================================================== | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov | PHONE :: 303.497.6469 | A flower falls, even though we love it; | and a weed grows, even though we do not love it. | --Dogen ===============================================================================