[omniORB] Missing Object Reference ( bug in omniORB !)
dmitry.dolinsky@tumbleweed.com
dmitry.dolinsky@tumbleweed.com
Thu, 02 Nov 2000 13:58:45 -0800
Looks like we found the cause for that mysterious problem with invalid
object reference, calling wrong object etc.
The bug is in this routine (objectRef.cc):
omniObject *
omni::locateObject(omniObjectManager*,omniObjectKey &k)
{
omniObject::objectTableLock.lock();
omniObject **p = &omniObject::localObjectTable[omniORB::hash(k)];
while (*p) {
if ((*p)->pd_objkey.native == k) {
(*p)->setRefCount((*p)->getRefCount()+1);
omniObject::objectTableLock.unlock();
return *p;
}
p = &((*p)->pd_next);
}
....
The problem is that unlock() happens before *p is dereferenced. So if
another object is added to the list between unlock and return, p will end
up pointing to a different object. It becomes clearer if you look at
p = &((*p)->pd_next);
p is set to be the address of a pd_next (pointer to the next element in the
list). Inserting a new element may change the value of pd_next and
therefore the value of *p.
The minimum delta fix is to deference before unlock():
omniObject *result = *p;
omniObject::objectTableLock.unlock();
return result;
Also, I don't see the need to use pointer to pointer. Another fix for this
would be to get rid of this extra level of indirection:
omniObject *
omni::locateObject(omniObjectManager*,omniObjectKey &k)
{
omniObject::objectTableLock.lock();
omniObject *p = omniObject::localObjectTable[omniORB::hash(k)];
while (p) {
if (p->pd_objkey.native == k) {
p->setRefCount(p->getRefCount()+1);
omniObject::objectTableLock.unlock();
return p;
}
p = p->pd_next;
}
....
Another thought is that it's a common technique to use automatic lock
objects that are created on the stack and unlock in the descructor. I'm not
sure why omniORB isn't using this, but this would have eliminated this
problem as well as simplify the code and make it exception safe.
Dmitry Dolinsky
Tumbleweed Software Corp.
At 11:44 AM 10/31/00 +0000, Sai-Lai Lo wrote:
>Dmitry,
>
>I don't recall discovering any race condition related to object reference
>counting between 2.6.1 and the current release.
>
>One point you want to check is if you have implemented the loader hook
>correctly. When your loader hook returns an object reference to the ORB,
>the ORB consumes the object reference. In other words, it will call
>release() on the object reference when it is done with it. Therefore you
>should do a _duplicate() on the object reference before returning in your
>loader hook.
>
>Sai-Lai
>
>
> >>>>> dmitry dolinsky writes:
>
> > We are running to a problem with omniORB 2.6.1. The ORB throws an
> > invalid object reference exception on a multiprocessor machine (
> > Solaris2.7 ) running for several hours under heavy load. The exception
> > was due to the fact that we try to _duplicate() or release() an object
> > reference where the reference count was at zero. After spending a fair
> > amount of time looking for missing _duplicate and race conditions in our
> > code, we did not
> > find the cause of the problem.
> > Has anyone experienced anything similar? Could this be a bug in omniORB
> > 2.6.1 perhaps fixed in a later version? (For a number of reasons we
> > can't jump to a later version at this time, BTW). Any advice for
> > tracking this down?
> > Also, on a very rare occasion we are seeing an error that make us
> > suspect that an operation is dispatched to wrong object. This does not
> > happen as often as losing the object ref problem, but when this does
> > happen, it
> > happens at about the same time. We suspect that the two problems may be
> > related.
> > Here is some more information about what we do. We backported some
> > patches for the scavenger from 2.8. Also, our objects persist in the
> > database. A limited number of them is in memory at any given time. Those
> > that are in
> > memory are in the cache maintained by us. When a request for an object
> > arrives that is not in memory, we load the object from database,
> > register it with ORB, and add it to the cache possibly removing some
> > other object
> >> From the cache and deleting it (by calling _dispose()). This happens in
> > omniORB::loader hook. Typically the problem manifests itself when we are
> > calling _dispose(). The object that we have in our cache has the
> > reference
> > count of zero.
>
>--
>Sai-Lai Lo S.Lo@uk.research.att.com
>AT&T Laboratories Cambridge WWW: http://www.uk.research.att.com
>24a Trumpington Street Tel: +44 1223 343000
>Cambridge CB2 1QA Fax: +44 1223 313542
>ENGLAND