[omniORB] Infinite polling loop with HUP socket causing 100% CPU usage

Mon May 11 11:14:59 BST 2015

On ma, 2015-05-11 at 10:52 +0100, Duncan Grisby wrote: 
> On Fri, 2015-05-08 at 16:43 +0200, Erik Cumps wrote:
> 
> [...]
> > We are using an older version of omniORB because we are tracking debian
> > wheezy, we have no immediate plans to change this.
> > 
> > Also I've noticed that 4.1.6 is still your latest stable release and
> > that there wasn't much activity in omniORB/src/lib/omniORB/orbcore/tcp/
> > so I'm hoping we can track this down in 4.1.6?
> 
> What makes you think that 4.1.6 is the latest stable release?  The
> latest stable release is 4.2.0. I can imagine that Debian would want to
> stick with the 4.1.x series for binary compatibility, but the last
> stable release in that series was 4.1.7.  Whichever way you look at it,
> 4.1.6 is not the latest release.

Sorry, my bad, I based this on http://omniorb.sourceforge.net/development.html

        ...
        trunk contains the current unstable development version. It is
        likely to become omniORB 4.2.0.

        branches contains stable branches of omniORB. omniORB and
        omniORBpy releases are synchronised with each other, so the
        branches are named after the omniORB release.

        branches/4_1 is the current stable branch, containing omniORB
        4.1.x and omniORBpy 3.x.

        tags contains tagged releases. omniORB and omniORBpy releases
        are synchronised with each other, so the tags are named after
        the omniORB release.

        tags/4_1_6 is the latest stable release, containing omniORB
        4.1.6 and omniORBpy 3.6.
        ...

> [...]
> > > If the socket omniORB is using to listen for new connections has become
> > > invalid, I'm not sure what it should do. It can't usefully carry on.
> > 
> > I agree. But it should do something different than performing an
> > infinite loop because it doesn't see the POLLERR on the fd or the
> > accept() call failing with EINVAL.
> > 
> > Would it make sense to handle the EINVAL as an EBADF here?
> 
> Probably, yes, but your server will still be broken, just in a slightly
> different way. It will no longer spin, but it will stop listening for
> new incoming connections. Really, we need to understand what has
> happened to cause the listening socket to fail. That should never occur.
> 
> Perhaps your code is accidentally closing a file descriptor that it
> shouldn't?

Yes. As a matter of fact, we had a condition where initially a logging
object on one thread created a /dev/log file descriptor.

After this some initialization runs which closes all open file
descriptors, and unfortunately the first thread is not aware of this and
keeps hanging on to the now closed file descriptor.

After this omniORB starts to run and for one of the omniorb threads the
internal pipe or listening socket reuses the previously closed /dev/log
file descriptor.

When at some point after that the logger object wants to log something
it will use a file descriptor which it thinks points to /dev/log but
which in reality points to a pipe or a network sockt. The logger object
notices the error when it tries to use the file descriptor, so it will
close it and create a new /dev/log file descriptor.

At this point the victim omniorb thread has its pipe or network socket
unexpectedly closed and it ends up in an infinite loop.

Now, for sure, the file descriptor activity is a bug and we fixed our
code, but I feel that omniORB should try to be either more robust or
more explicit in the handling of this exceptional case.

> [...]
> > However, in the else block of the if() statement modified by the patch,
> > we find:
> > 
> >                  else if (pd_pollfds[index].fd == pd_pipe_read) {
> >         #ifdef UnixArchitecture
> >                     char data;
> >                     read(pd_pipe_read, &data, 1);
> >                     pd_pipe_full = 0;
> >         #endif
> >                   }
> >         
> > So if the POLLHUP happens for the pd_pipe_read file descriptor, the code
> > will attempt to read a single byte and although it will actually read
> > zero bytes it will not detect this, so it cannot handle it. The
> > SocketCollection::Select() call will return 1, and it will keep on doing
> > this each time it is called again later.
> 
> That pipe is used totally internally within the omniORB process, as a
> way to wake up the thread blocked in poll(). It is never closed, so it
> can't ever get POLLHUP. (Except I suppose at server shutdown time, but
> in that case the thread doing the poll will know it's shutting down, so
> it doesn't matter what status it gets from the pipe.)

The same logic applies here.

Now I realise that in this case omniORB is a victim of circumstances
beyond its control, but as I said earlier I think omniORB should try to
do something more robust or explicit in this exceptional case, instead
of causing 100% cpu usage polling infinitely and waiting for a sys admin
to kill it. After all, the case is detectable with the unexpected
POLLERR or POLLHUP events.

As far as I'm concerned an abort() with coredump would be acceptable.

Kind regards,
Erik