Charlie Savage <cfis / savagexi.com> wrote:
> Sorry, these machines are actuall CentOS 5.6.  The latest patches were
> applied via yum update about a week ago, so its pretty up-to-date.

OK, I'm closer with 2.6.18-238.9.1.el5xen but still can't reproduce it.

I don't have permission to upgrade kernels on CentOS images,
unfortunately.  It's the weekend so the folks that do have permission
aren't around...

> So what we see is this test hanging:
> 
> def test_datagrams
> $in = $out = ""
> EM.run {
> EM.open_datagram_socket "127.0.0.1", @port, TestDatagramServer
> EM.open_datagram_socket "127.0.0.1", 0, TestDatagramClient, @port
> }
> assert_equal( "1234567890", $in )
> assert_equal( "abcdefghij", $out )
> end
> 
> It hangs on the first EM.open_datagram_socket call.

Can you show us "strace -f -v" output from that test?

Maybe sprinkle some  `fprintf(stderr, "%s:%d\n", __FILE__, __LINE__);'
or similar inside EventMachine_t::OpenDatagramSocket and see where it
gets to?  It shouldn't hit gethostbyname()...

> Here is another one, this time from test_pure_ruby.rb (which in fact seems misnamed, it is using the C code):
> 
> def test_connrefused
> assert_nothing_raised do
> EM.run {
> setup_timeout(2)
> EM.connect "127.0.0.1", @port, TestConnrefused
> }
> end
> 
> In this one, its the EM connect call that hangs.

I can't reproduce this, either...

Also, can you extract these tests and run with a hand-picked port?

> Let me know if there is anything we can do to help debug this.  Its
> happens across 8 servers (all of which are at the same CentOS release,
> albeit they did start as the same VM image a while back).

I assume you tried a clean build/install of Ruby to make sure all
objects got rebuilt and reinstalled?

Can you also try running `pmap $PID' on the hung processes to make sure
it's loading the correct libs + versions?