<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
<META NAME="GENERATOR" CONTENT="GtkHTML/4.4.4">
</HEAD>
<BODY>
<PING><BR>
<BR>
Ideas anyone?<BR>
<BR>
Thx,<BR>
Erik Cumps<BR>
<BR>
-----Original Message-----<BR>
<B>From</B>: Erik Cumps <<A HREF="mailto:Erik%20Cumps%20%3cerik.cumps@esaturnus.com%3e">erik.cumps@esaturnus.com</A>><BR>
<B>Reply-to</B>: erik.cumps@esaturnus.com<BR>
<B>To</B>: <A HREF="mailto:omniorb-list@omniorb-support.com">omniorb-list@omniorb-support.com</A><BR>
<B>Cc</B>: <A HREF="mailto:erik.cumps@esaturnus.com">erik.cumps@esaturnus.com</A><BR>
<B>Subject</B>: Infinite polling loop with HUP socket causing 100% CPU usage<BR>
<B>Date</B>: Tue, 21 Apr 2015 11:27:11 +0200<BR>
<BR>
<PRE>
Hi,
we just noticed a strange phenomenon with omniORB and we think this may be caused by a bug
in the socket handling implementation of omniORB.
Context: services implemented in python using omniORB and the omniORB python bindings, using
the following debian packages:
        ii libomniorb4-1 4.1.6-2 i386 omniORB core libraries
        ii python-omniorb 3.6-1 i386 Python bindings for omniORB
The service has a TCP socket open and when the remote client disconnects or when the connection
is broken the service has a thread that ends up in an infinite polling loop, using 100% CPU.
We have two cases, one service which has a thread repeatedly calling poll() and accept() on the
socket and the other service has a thread which is repeatedly calling poll() on the socket.
With strace the first case looks like this:
...
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 1) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
accept(5, 0, NULL) = -1 EINVAL (Invalid argument)
gettimeofday({1429606184, 378281}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 0) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
accept(5, 0, NULL) = -1 EINVAL (Invalid argument)
gettimeofday({1429606184, 378671}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 0) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
accept(5, 0, NULL) = -1 EINVAL (Invalid argument)
gettimeofday({1429606184, 379108}, NULL) = 0
gettimeofday({1429606184, 379209}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 50) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
accept(5, 0, NULL) = -1 EINVAL (Invalid argument)
gettimeofday({1429606184, 379729}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, 49) = 1 ([{fd=5, revents=POLLIN|POLLERR|POLLHUP}])
accept(5, 0, NULL) = -1 EINVAL (Invalid argument)
gettimeofday({1429606184, 380145}, NULL) = 0
...
As you can see each poll() returns with the POLLERR and POLLHUP events for the socket, and each
accept() returns with the EINVAL errno.
Also notice how the timeout for each poll() starts at 50 and becomes smaller until it reaches 0,
after which it starts again at 50.
The second case looks like this:
...
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 50) = 1 ([{fd=4, revents=POLLHUP}])
gettimeofday({1429607075, 414623}, NULL) = 0
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 50) = 1 ([{fd=4, revents=POLLHUP}])
gettimeofday({1429607075, 414870}, NULL) = 0
...
Again, notice how each poll() returns with the POLLHUP event for the socket.
So it looks like the code is not noticing the POLLHUP event and is also not
handling the accept() EINVAL error. This is presumably leading to the caller
continuously retrying and hence the polling loop.
We looked a bit at the code and we noticed that here is no POLLHUP handling.
In fact, grepping the code for POLLHUP returns no results at all.
And with regards to the EINVAL result of the accept() calls, we believe that
this error is reported (if tracing is on) but otherwise not acted upon. From
the CORBA::Boolean tcpEndpoint::notifyReadable(SocketHolder* sh) method in
src/lib/omniORB/orbcore/tcp/tcpEndpoint.cc:
if (sh == (SocketHolder*)this) {
// New connection
SocketHandle_t sock;
again:
sock = ::accept(pd_socket,0,0);
===> when the accept() returns, sock == RC_SOCKET_ERROR and ERRNO = EINVAL
if (sock == RC_SOCKET_ERROR) {
if (ERRNO == RC_EBADF) {
===> the EINVAL case does not end up here
omniORB::logs(20, "accept() returned EBADF, unable to continue");
return 0;
}
else if (ERRNO == RC_EINTR) {
===> the EINVAL case does not end up here either
omniORB::logs(20, "accept() returned EINTR, trying again");
goto again;
}
#ifdef UnixArchitecture
else if (ERRNO == RC_EAGAIN) {
===> the EINVAL case does not end up here either
omniORB::logs(20, "accept() returned EAGAIN, will try later");
}
#endif
if (omniORB::trace(20)) {
===> the EINVAL case ends up here only if omniORB::trace(20) returns true
===> even so it doesn't do anything useful with it
omniORB::logger log;
log << "accept() failed with unknown error " << ERRNO << "\n";
}
===> the if block is exited here
}
else {
#if defined(__vxWorks__)
// vxWorks "forgets" socket options
static const int valtrue = 1;
if(setsockopt(sock, IPPROTO_TCP, TCP_NODELAY,
(char*)&valtrue, sizeof(valtrue)) == ERROR) {
return 0;
}
#endif
// On some platforms, the new socket inherits the non-blocking
// setting from the listening socket, so we set it blocking here
// just to be sure.
SocketSetblocking(sock);
pd_new_conn_socket = sock;
}
===> finally the EINVAL case ends up here:
===> the socket holder is indicated to be selectable
===> and the functione returns true.
setSelectable(1,0,1);
return 1;
}
Kind regards,
Erik Cumps
</PRE>
</BODY>
</HTML>