[omniORB] omniNames crashes on OpenVMS Alpha with multiple kernel threads enabled

Wed, 07 Jun 2000 18:06:34 -0400

Sai-Lai,

Thank you for your responses.

First, I tried my reproducer with the latest CVS snapshot (without any patches). 
I was unable to get omniNames to crash.  This was encouraging.

Next, I tried a "real" application startup.  It failed in the same
place. (omniORB_Ripper::run_undetached invoking decrRefCount which calls
lock with a null Rope*).

So, I applied the patches and tried again.  Same thing.  See below.

> Perhaps you could try this patch:
> 
> Index: strand.cc
> ===================================================================
> RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/strand.cc,v
> retrieving revision 1.10.2.3
> diff -r1.10.2.3 strand.cc
> 95a96
> > #define CHECK_FOR_RACE_CONDITION 0x5246494E
> 166a168,179
> >
> > #ifdef CHECK_FOR_RACE_CONDITION
> >   if (pd_refcount == 0 && pd_ripper_next != this &&
> >       (omni::ptr_arith_t) pd_ripper_next != CHECK_FOR_RACE_CONDITION) {
> >
> >     // This strand has been handed to the ripper thread. The ref count
> >     // should not go to 0 before the ripper thread has processed it.
> >     LOGMESSAGE(0,"Strand::decrRefCount: unexpected race condition. Abort program");
> >     abort();
> >   }
> > #endif
> >
> 705a719,721
> > #ifdef CHECK_FOR_RACE_CONDITION
> >       p->pd_ripper_next = (Strand*) (omni::ptr_arith_t) CHECK_FOR_RACE_CONDITION;
> > #endif
> 
> What the patch does is to trap the condition when a strand has been handed
> to the ripper thread but before the ripper gets to it, another thread
> calls decrRefCount and causes the ref count to go to 0. This seems to be
> what you are getting with your most recent crash. This shouldn't
> happen but if it does at least we have a core dump to look at who
> is doing what at the time.
> 

Here's a patch to your patch:

175c175
<     LOGMESSAGE(0,"Strand::decrRefCount","unexpected race condition. Abort program");
---
>     LOGMESSAGE(0,"Strand::decrRefCount: unexpected race condition. Abort program");

And, no, I didn't get a race condition message.  I did get the
OpenVMS equivalent of a core dump but due to bugs in the debugger I
cannot analyze it at the moment. [One item on my list of OpenVMS gripes
is that you apparently cannot analyze a process dump that dies in a 
thread.  The dump was produced due to a DECthreads bugcheck resulting
from the dereferenced null pointer.]

> Also, Duncan points out that there is a potential problem with
> tcpSocketRendezvouser::run_detached() under a very rare exception condition.
> I'm not sure if such an exception condition exists but then its better to
> plug that hole:
> 
> Index: tcpSocketMTfactory.cc
> ===================================================================
> RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/tcpSocketMTfactory.cc,v
> retrieving revision 1.22.2.7
> diff -r1.22.2.7 tcpSocketMTfactory.cc
> 1145c1145
> <   tcpSocketStrand *newSt = 0;
> ---
> >   tcpSocketStrand *newSt;
> 1149a1150,1151
> >
> >     newSt = 0;
> 
> 

In other posts, Sai-Lai wrote:

> We know the strand assertion failure caused by thread create failure inside
> the ctor of tcpSocketWorker has been fixed. (By the way, if thread create
> fails, one should look into how to increase the operating system imposed
> limit on the no. of threads allowed per process.) So please use the latest
> 2.8 source from our cvs repostory or ftp snapshots.

I believe that DECthreads on OpenVMS does not intentionally impose any
draconian limits regarding the number of threads that a process is allowed
to create (which is an amazing statement since OpenVMS is famous for
imposing such limits!).

However, I did some testing and it turns out that with my set of quotas, I
can "only" create somewhere around 6000 threads in a process (using the
omnithread stacksize VMS default of 32K).  The pthread_create function
returns ENOMEM which indicates not enough memory for stack, etc., rather
than EAGAIN which would indicate an OS imposed restriction.  If I change
the stacksize to 16K, the number of threads created goes up to something like
8000.  I'm guessing that I'd run into some other problem before I got this
high anyway, but these numbers don't look right to me, so I may submit an
inquiry about this to Compaq.

> This is what has been entered in the rlog entry of tcpSocketMTfactory.cc:
> 
>   Revision 1.22.2.5  1999/10/27 18:17:36  sll
>   Fixed the ctor of tcpSocketWorker so that if thread create fails, the
>   exception raised by omnithread does not cause assertion failure in the
>   ORB.
>
> But Bruce seems to be seeing something else so this may not fix his problem.

Apparently not.

At least I seem to have a consistent reproducer.  So, it's time to do
some digging...

> Keep me posted if you have any new evidence.

Will do.

Bruce
-- 
All generalities are false - including this one.

Bruce Visscher                                        visschb@rjrt.com