Darrin Thompson wrote:
> I'm running a fairly complicated build and test system with DRb over
> Ruby 1.8.6. It involves 12 Linux machines running several different
> distro versions and one Windows machine.
> 
> Lately I've been having problems where once in awhile the machines
> involved in this system just stop communicating, and I can't figure out
> why. I've found on occasion I can work around the problem by changing
> the order of the operations or the frequency of them. It's more or less
> random when it occurs.
> 
> The only thing I can think of is that this all started when I added suse
> 9.3 and 9.4 machines to this system.
> 
> The other possibility is that now I have 12 Linux machines and a Windows
> machine all more or less arbitrarily talking with each other, so there
> might be a slowly increasing probability of a deadlock that I'm suddenly
> noticing because it's more likely with more machines.
> 
> I'm sitting here thinking of exotic ways TCP could be misconfigured out
> of the box on suse 9. But deep in my soul I'm sure it's some stupid code
> I wrote.
> 
> Anyway, the idea here is that a Windows machine sends messages to
> several Linux machines and the Linux machines send back log messages and
> occasionally a series of messages that represent the contents of a file.
> 
> If anyone has insight, I'd appreciate it. I'm running out of good ideas
> here.
> 
> --
> Darrin

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that, 
though.)

-- 
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407