[omniORB] Long discourse/proposal: omniOrb Etherealize/Incarnate races

Wed, 11 Jul 2001 17:56:40 -0400 (EDT)

All;

Summary

  Synchronization between closely-spaced etherealization and incarnate
operations on a single object is less than completely perfect.  Objects
which propagate state between incarnations can be corrupted.  Object
systems which interlock between etherealization and incarnation (most
easily, on the first etherealization/incarnation cycle) can deadlock.

Discussion of the Present State of OmniOrb303

  The sychronization between servant activator etherealize and incarnate
functionality seems to be somewhat less than fully exact.

  As near as I can determine, the standard process of omniORB303 is as
follows.

1.  Some outside mechanism (in my case, a Least Recently Used Evictor
    mechanism) determines that an object is of some appropriate state
    and invokes the deactivate_object member function in presentation
    by the Portable Object Adapter (POA) serving the servant.

2.  After some preliminary actions to determine the validity of the
    proposed operation and otherwise prepare for actions to come,
    deactivate_object invokes omni::deactivateObject, which takes the
    following further actions.

    a)  The omniLocalIdentity object associated with the deactivating
        servant is identified through a hash table.

    b)  The omniLocalIdentity object is examined to determine whether
        or not there are, in fact, local object references to the
        servant (presumably the result of _this() operations).  If this
        is the case, the following further actions are taken.

          i)  A new omniLocalIdentity object is obtained having the same
              key values as those of the identity object previously
              identified, but with none of the other connections to the
              existing servant.  The newly-obtained identity object is
              substituted for the original in the hash table, implicitly
              removing the original identity object from the hash table.

         ii)  A loop is executed to convert each existing local object
              reference over to a remote object reference through a
              defaultLoopBack rope facility (whatever that is).

              I'm not sure what this does, but I suppose it might circumvent
              direct object reference to servant object member function calls
              and route things back through dispatch_to_sa so that the
              incarnate mechanism will be invoked should a local method
              delivery occur.

        If there were no local object references, the following alternative
        actions are taken.

          i)  The identified omniLocalIdentity object is extracted from the
              hash table.

        Note that, regardless of which set of actions is taken, the
        original, identified omniLocalIdentity object is extracted from
        the hash table.  A new local identity giving a remote location
        for the object, but no connection to the actual local servant
        components, may or may not exist in the hash table.

    c)  A backward edge from the servant to the original identity object
        is broken and a pointer to the identity object is returned as the
        functional result of the omni::deactivateObject operation.

3.  The deactivate member function in presentation by the returned (by
    pointer) omniLocalIdentity object is invoked.  This serves merely to
    decrement a pending method (?) counter maintained by the identity
    object.

    If this counter reaches 0, there are no method deliveries pending
    for (or currently being executed by) the servant and, thus, the
    servant would be considered idle.

4.  The removeFromOAObjList member function in presentation by the returned
    omniLocalIdentity object is invoked to extract that object from a
    doubly-linked list maintained by the POA (?) of objects which it is
    serving.

    I'm not sure what this list does, but it seems to have more to do with
    rounding up servants when the POA dies than with dispatching methods
    to the servant.

5.  If the object appears to be idle (as revealed by the method count
    maintained by the identity object), the following actions are taken.

    a)  If a servant activator can be identified, a task to etherealize
        the servant is queued; otherwise, the reference count of the
        servant is simply decremented, possibly resulting in the
        destruction of the servant should this be, in fact, the last
        reference to the servant.

    b)  The identify object is discarded by invoking its own die member
        function.

    If the object is not idle, the following alternative actions are
    taken.

    a)  detached_object is invoked, which simply increments a counter
        of objects that have been detached from the POA.  I'm sure this
        is wonderful, but I don't really understand just what flows
        from this as a result.

6.  At this point, if you are not terribly lucky, a method delivery for
    the object now queued for etherealization now arrives.  Apparently,
    nothing in the delivery process checks for the existence of the object
    until it arrives at the dispatch_to_sa member function presented by
    the POA.  Various checks to assure the validity of the proposed
    method delivery, etc., are performed.

    One of these preparatory operations is the obtaining of a lock on 
    the mutual exclusion lock serializing access to the attached servant
    activator object.  Why this lock is obtained at such an early point,
    especially since it is not yet certain that the servant activator
    object will actually be accessed, is unclear.

7.  The hash table from which the original omniLocalIdentity object was
    extracted is examined to determine if such a local identity object
    presently exists.  If such an identity object exists and it actually
    has a connection to a servant, then the method is dispatched to the
    servant (through the identity object) and functional return is made;
    however, if no such identity exists, or if the identity exists, but
    does not have a connection to the servant, as in the case of the
    new identity substituted in the has table to satisfy the needs of
    local object references, then the following actions are taken.

    a)  The incarnate member function in presentation by the servant
        activator attached to the POA is invoked to create a fresh
        incarnation of the servant.

        Note that this invokation on the servant activator is protected
        by a mutual exclusion lock upon which a lock was obtain during
        the preparation phase of the dispatch_to_sa member function.

    b)  The method is delivered to the newly-reincarnated servant and
        functional return is made.  Possible exceptions to the method
        execution are fielded, converted as appropriate, and rethrown.

The Difficulty

1.  The problem between etherealization and incarnation operations arises
    because the appropriate omniLocalIdentity (with the connections to the
    local servant object) is extracted from the hash table (the active
    object map?) just prior to (or, if method deliveries are still pending,
    possibly considerably before) the queueing of the etherealization task.

2.  The absense of this identity object from the hash table is then taken
    by dispatch_to_sa as a sign that etherealization has occurred when,
    in fact there is some finite probability that the etherealization act,
    itself, has only been queued and is still pending.

3.  A mutual exclusion lock is held on the entire servant activator object
    while the dispatch_to_sa/incarnate complex is executing.  If a queued
    etherealize action attempts to execute during that time (even if it is
    not etherealizing the object being simulatenously incarnated), it will
    be blocked until that incarnation is completed.  This can lead to the
    reversal of the sequence: etherealizing an object after its (supposedly)
    subseqent reincarnation.

4.  If the act of incarnation requires that the previous etherealization
    be complete (for example, to obtain the prior state of the object in
    its re-incarnated form, as is the case in my own application), then
    this mis-synchronization of the two acts leads to probably invalid
    operation and, possibly, to a deadlock.

    a)  If an etherealization/incarnation cycle occurred prior to the
        pending etherealization task, a mis-synchronized incarnate can
        run to completion, but will pick up a possibly out-of-date object
        state.  Presuming that the pending etherealization does run while
        the re-incarnated object fields method deliveries, the subsequent
        etherealization of the re-incarnated object will overwrite that
        state with the now corrupted state of the object.

    b)  If no previous etherealization recorded a previous object state,
        the incarnate operation may hang or fail for lack of an object
        state to restore.  Attempts to wait for the etherealization task
        to complete are futile since the dispatch_to_sa/incarnate operation
        holds a blocking lock on the servant activator object.  Should
        the etherealization task be dequeued and executed, it will block
        pending the completion of the incarnate task, which itself is
        waiting for the completion of the block etherealization task.

  On can, of course, scoff at the improbability of such a race condition
ever resulting in such a deadlock.  The window of opportunity is narrow,
comprising merely the time necessary for an independent thread to dequeue
the etherealization task and execute it.  Scoff, scoff, scoff.  It, quite
naturally, happened to me on my first test case.

Solutions

  Unfortunately, this is a little tougher.  I am hesitant to propose code
adjustments since I have only been studying omniOrb303 source code for a
few days.

  Further, commentary in the source code is sparse and alludes, to the
extent that I understand it, to a place-holder in the active object map
(the hash table?).  Commentary concludes with the laconic

  "Ignore for now..."

  Finally, perhaps omniORB4 has treated this problem?

  Perhaps the simplest solution that occurs to this newbie mind is as
follows.

1.  Add a pointer in the omniLocalIdentity object to a pending
    etherealization task element and a backward pointer from that
    element to the identity object.  By default, both pointers will
    be null.

2.  Leave an identification object in the hash table corresponding to the
    etherealizing servant.

    a)  In the event that object references to the servant exist, this can
        be the new identity object substituted by the present code.

    b)  If no such object references existed, a new identity object with
        only the appropriate object keys (and no servant connections) can
        be left in the hash table.

        I have not investigated the full ramifications of this.  I hope that
        leaving an unconnected-in-any-way identification object won't break
        everything in sight and beyond.  There is, unfortunately, a lot of
        code somewhat beyond the visible horizon.

3.  Without regard to which option left an omniLocalIdentity object in the
    hash table, adjust add_object_to_etherialization_queue code to form
    connections from the identification object to the etherealization task
    element when that task is queued.

4.  Adjust code in the etherealization task element doit() code to break
    the connections between identification and task element when the task
    completes.

5.  Adjust code in omniOrbPOA::dispatch_to_sa to examine any found
    omniLocalIdentity object for the presense of a etherealization task.
    Presumably there is a lock of the task queue that will have to be held
    before this examination can be made.  A lock on the servant activator
    should further synchronize access to these decision elements since
    that will either pervent the task from running if it is not already,
    or delay the dispatch_to_sa code until that running task completes.

    If such an etherealization task is found to exist, dequeue it and
    run it to completion from dispatch_to_sa before proceeding on to
    examine the identification object.  If the identification object
    reveals no connection to a servant, incarnate may now be run with
    the assurance that the corresponding etherealize has been completed.

6.  (Optional) Adjust omniOrbPOA::dispatch_to_sa code to delay acquisition
    of the servant_activator_lock until it is clear that incarnate will
    necessarily be run.  

    I do not see an immediate reason to suspend etherealization operations
    each time a method is to be dispatched to a servant in the purview of
    the servant activator.  A method may initiate object deactivation, but
    I see no cause of conflict between that and an ongoing etherealization
    of some other object.

Of course I never read the CORBA standard.  Perhaps this behaviour is just
as it is supposed to be.

  If anybody has any thoughts, I would be happy to hear them, especially
before I start hacking at the code.  Of course, there is always the .zip
file if things don't work out.

Thanks,

Bill

--

Dr. William H. Jones
MS 5-11
NASA Lewis Research Center
21000 Brookpark Road
Cleveland, OH 44135

216 433 5862
Preferred:  William.H.Jones@grc.nasa.gov
Hard code:  enjones@witsend.grc.nasa.gov
Personal:   whjones@apk.net

Project Integration Architecture:
  http://www.grc.nasa.gov/WWW/price000/index.html

No man is so fortunate but that, at the hour of his doom,
he will not have at least a few about him who are pleased
with the proceedings.
    -- Marcus Aurelius