Lennon Day-Reynolds <rcoder / gmail.com> wrote in message news:<5d4c612404091310044fa47610 / mail.gmail.com>...
> Chris,
> 
> There are quite a few reasons this could be going wrong, both inside
> and outside of your Ruby test harness. Since you said that the problem
> was only reproducible on long (>1hr.) tests, though, I would suspect
> socket connection (or other system-level resource) timeouts.

This is what I'm suspecting too; but the weird thing is, I would
expect to get something like a 0 result out of the read, or a
'connection reset by peer' or some similar business, not an Invalid
Handle sort of error.

The other thing that bugs me about this is that we got an error on the
receive part of the connection; I don't want that!  If it's gonna
fail, I'd much rather it fail on the send.  Is this related somehow? 
ie when you close the write end of a socket, you send a FIN; so if the
other end ('them') has closed, it will send the FIN, immediately get
the ACK (courtesy of the kernel) and go into FIN_WAIT_2, whereas the
receiving end ('us') will go into CLOSE_WAIT (as it waits for your app
to notice and close the socket).  So then we just write the request
(there's no way for the opposite end to signal that it's not going to
read any more; on Unix if the opposite end is closed I generally find
that I can write once, then I get an EPIPE on the second write), and
then we proceed to try to read, when we get the invalid handle error. 
The other alternative is somehow this socket got closed but we're
still using it...  but how would that happen between us writing and
reading?

The only issue with this is that the liveness checking should have
already detected that the socket was closed, ya?  it does a quick
select() poll on the
socket to see if it was readable; this should notice if the socket was
closed I would think...

> 
> I would suggest adding a 'ping()' method to your DRb server, and then
> having clients call it periodically (say, every 5-10 seconds) in a
> background thread or process, as well as optionally before any call
> with important data to be transferred. That way, both the client and
> the server can detect connection failures before you have to worry
> about losing data.

Well, that's the problem; I'm not totally sure that that will actually
help us out, since the connection seemed to fail in the mid-point; so
maybe the ping will succeed, which is well and good, but how do I know
the data traffic won't right after?  Not to mention the fact that the
pool may grow, and therefore the ping would get a different connection
than the subsequent 'real' call...

> DRb is cheap wire-level scaffolding, but it's not a reliable messaging
> system; that has to be handled at the application level.

Yup; I'm not expecting foolproof-ness (I'm much too ingenious) but I'm
just curious how it can fail in the rather bizarre way it seems to be
failing in...

Thanks,
Chris