[omniORB] omniNames crashes on OpenVMS Alpha with multiple
kernel threads enabled
Bruce Visscher
visschb@rjrt.com
Wed, 07 Jun 2000 18:06:34 -0400
Sai-Lai,
Thank you for your responses.
First, I tried my reproducer with the latest CVS snapshot (without any patches).
I was unable to get omniNames to crash. This was encouraging.
Next, I tried a "real" application startup. It failed in the same
place. (omniORB_Ripper::run_undetached invoking decrRefCount which calls
lock with a null Rope*).
So, I applied the patches and tried again. Same thing. See below.
> Perhaps you could try this patch:
>
> Index: strand.cc
> ===================================================================
> RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/strand.cc,v
> retrieving revision 1.10.2.3
> diff -r1.10.2.3 strand.cc
> 95a96
> > #define CHECK_FOR_RACE_CONDITION 0x5246494E
> 166a168,179
> >
> > #ifdef CHECK_FOR_RACE_CONDITION
> > if (pd_refcount == 0 && pd_ripper_next != this &&
> > (omni::ptr_arith_t) pd_ripper_next != CHECK_FOR_RACE_CONDITION) {
> >
> > // This strand has been handed to the ripper thread. The ref count
> > // should not go to 0 before the ripper thread has processed it.
> > LOGMESSAGE(0,"Strand::decrRefCount: unexpected race condition. Abort program");
> > abort();
> > }
> > #endif
> >
> 705a719,721
> > #ifdef CHECK_FOR_RACE_CONDITION
> > p->pd_ripper_next = (Strand*) (omni::ptr_arith_t) CHECK_FOR_RACE_CONDITION;
> > #endif
>
> What the patch does is to trap the condition when a strand has been handed
> to the ripper thread but before the ripper gets to it, another thread
> calls decrRefCount and causes the ref count to go to 0. This seems to be
> what you are getting with your most recent crash. This shouldn't
> happen but if it does at least we have a core dump to look at who
> is doing what at the time.
>
Here's a patch to your patch:
175c175
< LOGMESSAGE(0,"Strand::decrRefCount","unexpected race condition. Abort program");
---
> LOGMESSAGE(0,"Strand::decrRefCount: unexpected race condition. Abort program");
And, no, I didn't get a race condition message. I did get the
OpenVMS equivalent of a core dump but due to bugs in the debugger I
cannot analyze it at the moment. [One item on my list of OpenVMS gripes
is that you apparently cannot analyze a process dump that dies in a
thread. The dump was produced due to a DECthreads bugcheck resulting
from the dereferenced null pointer.]
> Also, Duncan points out that there is a potential problem with
> tcpSocketRendezvouser::run_detached() under a very rare exception condition.
> I'm not sure if such an exception condition exists but then its better to
> plug that hole:
>
> Index: tcpSocketMTfactory.cc
> ===================================================================
> RCS file: /project/omni/cvsroot/omni/src/lib/omniORB2/orbcore/tcpSocketMTfactory.cc,v
> retrieving revision 1.22.2.7
> diff -r1.22.2.7 tcpSocketMTfactory.cc
> 1145c1145
> < tcpSocketStrand *newSt = 0;
> ---
> > tcpSocketStrand *newSt;
> 1149a1150,1151
> >
> > newSt = 0;
>
>
In other posts, Sai-Lai wrote:
> We know the strand assertion failure caused by thread create failure inside
> the ctor of tcpSocketWorker has been fixed. (By the way, if thread create
> fails, one should look into how to increase the operating system imposed
> limit on the no. of threads allowed per process.) So please use the latest
> 2.8 source from our cvs repostory or ftp snapshots.
I believe that DECthreads on OpenVMS does not intentionally impose any
draconian limits regarding the number of threads that a process is allowed
to create (which is an amazing statement since OpenVMS is famous for
imposing such limits!).
However, I did some testing and it turns out that with my set of quotas, I
can "only" create somewhere around 6000 threads in a process (using the
omnithread stacksize VMS default of 32K). The pthread_create function
returns ENOMEM which indicates not enough memory for stack, etc., rather
than EAGAIN which would indicate an OS imposed restriction. If I change
the stacksize to 16K, the number of threads created goes up to something like
8000. I'm guessing that I'd run into some other problem before I got this
high anyway, but these numbers don't look right to me, so I may submit an
inquiry about this to Compaq.
> This is what has been entered in the rlog entry of tcpSocketMTfactory.cc:
>
> Revision 1.22.2.5 1999/10/27 18:17:36 sll
> Fixed the ctor of tcpSocketWorker so that if thread create fails, the
> exception raised by omnithread does not cause assertion failure in the
> ORB.
>
> But Bruce seems to be seeing something else so this may not fix his problem.
Apparently not.
At least I seem to have a consistent reproducer. So, it's time to do
some digging...
> Keep me posted if you have any new evidence.
Will do.
Bruce
--
All generalities are false - including this one.
Bruce Visscher visschb@rjrt.com