[omniORB] Long running RMI causes TCP RST

Sat Mar 12 21:41:36 GMT 2005

I'm having a bit of a problem in diagnosing a particular issue. I have a 
client app that makes a remote method invocation into a server which 
takes a long time to complete. On some client machines (all Windows XP) 
at some sites, the remote method invocation consistently works. On other 
client machines at remote sites, the remote method invocation sometimes 
causes the client to appear to hang. Debugging output shows that the 
following is happening:

---
1. Client calls longRunningMethod(), which should return a result
2. Server executes longRunningMethod() and returns data to the client
3. Client never returns from its invocation of longRunningMethod()
---

I hooked up ethereal to this process and I can see the following occur:

---
Client calls longRunningMethod() by sending a TCP packet with (PSH,ACK)
Server responds immediately with an (ACK)

  after executing longRunningMethod()

Server sends client a TCP packet response with (PSH,ACK)
Client responds with a TCP packet (RST)
---

The reset causes the server to close the connection. Meanwhile the 
client (as far as the execution of the app is concerned) is still 
sitting in the RMI. No exception is thrown inside the client 
application. I can only think of a few possibilities:

1. The client-side thread that opened the socket has been killed. The 
app itself hasn't, because I have a GUI thread that still responds to 
user input. So when the TCP/IP stack gets the (PSH,ACK) from the server, 
it responds with a RST. This is something I can test. A side effect of 
this incident is that when the user chooses "Exit" via the GUI thread, 
which invokes CWinApp::ExitInstance(), the process just hangs and has to 
be terminated via Task Manager.

2. These clients sit behind NAT firewalls, and it is possible that the
firewall has determined that the connection is no longer in use and is 
responding to the server with the RST.

I'm doubtful about #1, since the client app consistently works on some 
remote sites. I'm leaning toward #2, except that:

1. Users have experienced this behavior within a matter of a few minutes.

2. Users do not experience this behavior consistently (at precisely 30 
minutes, for example.)

3. Some users are sitting behind identically-configured firewalls as 
other users at other remote sites that don't experience the problem.

4. I *believe* the default timeout interval for Linux NAT, which is 
firewalling a few of these remote sites, is 1 day.

Any thoughts?

Mike Mascari