[omniORB] Omniorb threads dying without reason

Mon Sep 18 22:18:47 BST 2006

Hi everyone,

We are using Omniorb for high load application. The load on server side
is balanced on two HW nodes holding several application instances, and the
clients connect distinctivelly to one node based on some dispatching rule.

Recently we ran into following problem: sometimes, on server side, ORB
threads
are dying without being respawned again. This affects the client side,
as the whole deployment throughput drops dramatically.

This happens while application is running in thread pool mode, and an
Umarshalling exception occurs:

omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) inputMessage: from giop:tcp:10.230.169.177:56894 460 bytes<AA>
omniORB: (4) Dispatching remote call 'AccountBalanceUpdate5' to: root/my
poa<amServer> (active)<AA>
omniORB: (4) sendChunk: to giop:tcp:10.230.169.177:56894 204 bytes<AA>
omniORB: (4) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (4) Server connection refcount = 1<AA>
omniORB: (4) AsyncInvoker: thread id = 4 has exited. Total threads = 15<AA>

While checking the Omni code, I discovered that "giopStream.cc:834" line
should never be reached under
any conditions, at least the comments say so.
giopStream.cc:

    784
////////////////////////////////////////////////////////////////////////
    785 giopStream_Buffer*
    786 giopStream::inputMessage() {
    ............................................
    823   while ((buf->last - buf->start) < 12) {
    824
    825     int rsz = pd_strand->connection->Recv((void*)
    826             ((omni::ptr_arith_t)buf+buf->last),
    827             (size_t) (buf->end - buf->last),
    828             pd_deadline_secs,
    829             pd_deadline_nanosecs);
    830     if (rsz > 0) {
    831       buf->last += rsz;
    832     }
    833     else {
    834       errorOnReceive(rsz,__FILE__,__LINE__,buf,0);
    835       // never reaches here.
    836     }
    837   }

Checking up the call stack, I noticed in invoker.cc code, that any
unhandled exception
is being catched, and a warning is printed.
Moreover, the task wrapper catches any possible exception, and we never
got such warning in logs.
So the possibility of uncaught exception does not exists.

invoker.cc:
    136   void real_run() {
    137
    138     if (omniORB::trace(10)) {
    .........................................................
    180       pd_pool->pd_lock->unlock();
    181       try {
    182   pd_task->execute();
    183       }
    184       catch(...) {
    185   omniORB::logs(1, "AsyncInvoker: Warning: unexpected exception "
    186           "caught while executing a task.");
    187       }

On other hand, in other cases, we get UnmarshallingException, but the
threads continue to work normally.

omniORB: (18) Dispatching remote call 'SubscriptionQuery5' to: root/my
poa<smServer> (active)<AA>
omniORB: (18) sendChunk: to giop:tcp:10.230.169.177:56903 453 bytes<AA>
omniORB: (18) giopWorker task execute.<AA>
omniORB: (18) throw giopStream::CommFailure from
giopStream.cc:834(0,NO,COMM_FAILURE_UnMarshalArguments)<AA>
omniORB: (18) Server connection refcount = 1<AA>
omniORB: (18) Server connection refcount = 0<AA>
omniORB: (18) Server close connection from giop:tcp:10.230.169.177:56930<AA>
omniORB: (18) giopWorker task execute.<AA>

The difference that I could notice is that for second case, worker thread
is being used
for another job. It might be that due to low load, the thread is being
destroyed in the first case.
Also, another difference, is that the reference count towards the server
side object. In second case,
this reaches zero, while in the first case it remains 1.

Bellow is the server side configuration.
ORBthreadPerConnectionPolicy=1;         export ORBthreadPerConnectionPolicy
ORBthreadPerConnectionUpperLimit=16;    export
ORBthreadPerConnectionUpperLimit
ORBthreadPerConnectionLowerLimit=8;     export
ORBthreadPerConnectionLowerLimit
ORBthreadPoolWatchConnection=0;         export ORBthreadPoolWatchConnection

Any idea what could cause the thread problem is welcome (including
the explanation that there is no problem :) ).

Thanks in advance,
Nic