[omniORB] Long running RMI causes TCP RST
Mike Mascari
mascarm at mascari.com
Sat Mar 12 21:41:36 GMT 2005
I'm having a bit of a problem in diagnosing a particular issue. I have a
client app that makes a remote method invocation into a server which
takes a long time to complete. On some client machines (all Windows XP)
at some sites, the remote method invocation consistently works. On other
client machines at remote sites, the remote method invocation sometimes
causes the client to appear to hang. Debugging output shows that the
following is happening:
---
1. Client calls longRunningMethod(), which should return a result
2. Server executes longRunningMethod() and returns data to the client
3. Client never returns from its invocation of longRunningMethod()
---
I hooked up ethereal to this process and I can see the following occur:
---
Client calls longRunningMethod() by sending a TCP packet with (PSH,ACK)
Server responds immediately with an (ACK)
after executing longRunningMethod()
Server sends client a TCP packet response with (PSH,ACK)
Client responds with a TCP packet (RST)
---
The reset causes the server to close the connection. Meanwhile the
client (as far as the execution of the app is concerned) is still
sitting in the RMI. No exception is thrown inside the client
application. I can only think of a few possibilities:
1. The client-side thread that opened the socket has been killed. The
app itself hasn't, because I have a GUI thread that still responds to
user input. So when the TCP/IP stack gets the (PSH,ACK) from the server,
it responds with a RST. This is something I can test. A side effect of
this incident is that when the user chooses "Exit" via the GUI thread,
which invokes CWinApp::ExitInstance(), the process just hangs and has to
be terminated via Task Manager.
2. These clients sit behind NAT firewalls, and it is possible that the
firewall has determined that the connection is no longer in use and is
responding to the server with the RST.
I'm doubtful about #1, since the client app consistently works on some
remote sites. I'm leaning toward #2, except that:
1. Users have experienced this behavior within a matter of a few minutes.
2. Users do not experience this behavior consistently (at precisely 30
minutes, for example.)
3. Some users are sitting behind identically-configured firewalls as
other users at other remote sites that don't experience the problem.
4. I *believe* the default timeout interval for Linux NAT, which is
firewalling a few of these remote sites, is 1 day.
Any thoughts?
Mike Mascari
More information about the omniORB-list
mailing list