[omniORB] Long discourse/proposal: omniOrb Etherealize/Incarnate races
William H Jones
enjones@lerc.nasa.gov
Wed, 11 Jul 2001 17:56:40 -0400 (EDT)
All;
Summary
Synchronization between closely-spaced etherealization and incarnate
operations on a single object is less than completely perfect. Objects
which propagate state between incarnations can be corrupted. Object
systems which interlock between etherealization and incarnation (most
easily, on the first etherealization/incarnation cycle) can deadlock.
Discussion of the Present State of OmniOrb303
The sychronization between servant activator etherealize and incarnate
functionality seems to be somewhat less than fully exact.
As near as I can determine, the standard process of omniORB303 is as
follows.
1. Some outside mechanism (in my case, a Least Recently Used Evictor
mechanism) determines that an object is of some appropriate state
and invokes the deactivate_object member function in presentation
by the Portable Object Adapter (POA) serving the servant.
2. After some preliminary actions to determine the validity of the
proposed operation and otherwise prepare for actions to come,
deactivate_object invokes omni::deactivateObject, which takes the
following further actions.
a) The omniLocalIdentity object associated with the deactivating
servant is identified through a hash table.
b) The omniLocalIdentity object is examined to determine whether
or not there are, in fact, local object references to the
servant (presumably the result of _this() operations). If this
is the case, the following further actions are taken.
i) A new omniLocalIdentity object is obtained having the same
key values as those of the identity object previously
identified, but with none of the other connections to the
existing servant. The newly-obtained identity object is
substituted for the original in the hash table, implicitly
removing the original identity object from the hash table.
ii) A loop is executed to convert each existing local object
reference over to a remote object reference through a
defaultLoopBack rope facility (whatever that is).
I'm not sure what this does, but I suppose it might circumvent
direct object reference to servant object member function calls
and route things back through dispatch_to_sa so that the
incarnate mechanism will be invoked should a local method
delivery occur.
If there were no local object references, the following alternative
actions are taken.
i) The identified omniLocalIdentity object is extracted from the
hash table.
Note that, regardless of which set of actions is taken, the
original, identified omniLocalIdentity object is extracted from
the hash table. A new local identity giving a remote location
for the object, but no connection to the actual local servant
components, may or may not exist in the hash table.
c) A backward edge from the servant to the original identity object
is broken and a pointer to the identity object is returned as the
functional result of the omni::deactivateObject operation.
3. The deactivate member function in presentation by the returned (by
pointer) omniLocalIdentity object is invoked. This serves merely to
decrement a pending method (?) counter maintained by the identity
object.
If this counter reaches 0, there are no method deliveries pending
for (or currently being executed by) the servant and, thus, the
servant would be considered idle.
4. The removeFromOAObjList member function in presentation by the returned
omniLocalIdentity object is invoked to extract that object from a
doubly-linked list maintained by the POA (?) of objects which it is
serving.
I'm not sure what this list does, but it seems to have more to do with
rounding up servants when the POA dies than with dispatching methods
to the servant.
5. If the object appears to be idle (as revealed by the method count
maintained by the identity object), the following actions are taken.
a) If a servant activator can be identified, a task to etherealize
the servant is queued; otherwise, the reference count of the
servant is simply decremented, possibly resulting in the
destruction of the servant should this be, in fact, the last
reference to the servant.
b) The identify object is discarded by invoking its own die member
function.
If the object is not idle, the following alternative actions are
taken.
a) detached_object is invoked, which simply increments a counter
of objects that have been detached from the POA. I'm sure this
is wonderful, but I don't really understand just what flows
from this as a result.
6. At this point, if you are not terribly lucky, a method delivery for
the object now queued for etherealization now arrives. Apparently,
nothing in the delivery process checks for the existence of the object
until it arrives at the dispatch_to_sa member function presented by
the POA. Various checks to assure the validity of the proposed
method delivery, etc., are performed.
One of these preparatory operations is the obtaining of a lock on
the mutual exclusion lock serializing access to the attached servant
activator object. Why this lock is obtained at such an early point,
especially since it is not yet certain that the servant activator
object will actually be accessed, is unclear.
7. The hash table from which the original omniLocalIdentity object was
extracted is examined to determine if such a local identity object
presently exists. If such an identity object exists and it actually
has a connection to a servant, then the method is dispatched to the
servant (through the identity object) and functional return is made;
however, if no such identity exists, or if the identity exists, but
does not have a connection to the servant, as in the case of the
new identity substituted in the has table to satisfy the needs of
local object references, then the following actions are taken.
a) The incarnate member function in presentation by the servant
activator attached to the POA is invoked to create a fresh
incarnation of the servant.
Note that this invokation on the servant activator is protected
by a mutual exclusion lock upon which a lock was obtain during
the preparation phase of the dispatch_to_sa member function.
b) The method is delivered to the newly-reincarnated servant and
functional return is made. Possible exceptions to the method
execution are fielded, converted as appropriate, and rethrown.
The Difficulty
1. The problem between etherealization and incarnation operations arises
because the appropriate omniLocalIdentity (with the connections to the
local servant object) is extracted from the hash table (the active
object map?) just prior to (or, if method deliveries are still pending,
possibly considerably before) the queueing of the etherealization task.
2. The absense of this identity object from the hash table is then taken
by dispatch_to_sa as a sign that etherealization has occurred when,
in fact there is some finite probability that the etherealization act,
itself, has only been queued and is still pending.
3. A mutual exclusion lock is held on the entire servant activator object
while the dispatch_to_sa/incarnate complex is executing. If a queued
etherealize action attempts to execute during that time (even if it is
not etherealizing the object being simulatenously incarnated), it will
be blocked until that incarnation is completed. This can lead to the
reversal of the sequence: etherealizing an object after its (supposedly)
subseqent reincarnation.
4. If the act of incarnation requires that the previous etherealization
be complete (for example, to obtain the prior state of the object in
its re-incarnated form, as is the case in my own application), then
this mis-synchronization of the two acts leads to probably invalid
operation and, possibly, to a deadlock.
a) If an etherealization/incarnation cycle occurred prior to the
pending etherealization task, a mis-synchronized incarnate can
run to completion, but will pick up a possibly out-of-date object
state. Presuming that the pending etherealization does run while
the re-incarnated object fields method deliveries, the subsequent
etherealization of the re-incarnated object will overwrite that
state with the now corrupted state of the object.
b) If no previous etherealization recorded a previous object state,
the incarnate operation may hang or fail for lack of an object
state to restore. Attempts to wait for the etherealization task
to complete are futile since the dispatch_to_sa/incarnate operation
holds a blocking lock on the servant activator object. Should
the etherealization task be dequeued and executed, it will block
pending the completion of the incarnate task, which itself is
waiting for the completion of the block etherealization task.
On can, of course, scoff at the improbability of such a race condition
ever resulting in such a deadlock. The window of opportunity is narrow,
comprising merely the time necessary for an independent thread to dequeue
the etherealization task and execute it. Scoff, scoff, scoff. It, quite
naturally, happened to me on my first test case.
Solutions
Unfortunately, this is a little tougher. I am hesitant to propose code
adjustments since I have only been studying omniOrb303 source code for a
few days.
Further, commentary in the source code is sparse and alludes, to the
extent that I understand it, to a place-holder in the active object map
(the hash table?). Commentary concludes with the laconic
"Ignore for now..."
Finally, perhaps omniORB4 has treated this problem?
Perhaps the simplest solution that occurs to this newbie mind is as
follows.
1. Add a pointer in the omniLocalIdentity object to a pending
etherealization task element and a backward pointer from that
element to the identity object. By default, both pointers will
be null.
2. Leave an identification object in the hash table corresponding to the
etherealizing servant.
a) In the event that object references to the servant exist, this can
be the new identity object substituted by the present code.
b) If no such object references existed, a new identity object with
only the appropriate object keys (and no servant connections) can
be left in the hash table.
I have not investigated the full ramifications of this. I hope that
leaving an unconnected-in-any-way identification object won't break
everything in sight and beyond. There is, unfortunately, a lot of
code somewhat beyond the visible horizon.
3. Without regard to which option left an omniLocalIdentity object in the
hash table, adjust add_object_to_etherialization_queue code to form
connections from the identification object to the etherealization task
element when that task is queued.
4. Adjust code in the etherealization task element doit() code to break
the connections between identification and task element when the task
completes.
5. Adjust code in omniOrbPOA::dispatch_to_sa to examine any found
omniLocalIdentity object for the presense of a etherealization task.
Presumably there is a lock of the task queue that will have to be held
before this examination can be made. A lock on the servant activator
should further synchronize access to these decision elements since
that will either pervent the task from running if it is not already,
or delay the dispatch_to_sa code until that running task completes.
If such an etherealization task is found to exist, dequeue it and
run it to completion from dispatch_to_sa before proceeding on to
examine the identification object. If the identification object
reveals no connection to a servant, incarnate may now be run with
the assurance that the corresponding etherealize has been completed.
6. (Optional) Adjust omniOrbPOA::dispatch_to_sa code to delay acquisition
of the servant_activator_lock until it is clear that incarnate will
necessarily be run.
I do not see an immediate reason to suspend etherealization operations
each time a method is to be dispatched to a servant in the purview of
the servant activator. A method may initiate object deactivation, but
I see no cause of conflict between that and an ongoing etherealization
of some other object.
Of course I never read the CORBA standard. Perhaps this behaviour is just
as it is supposed to be.
If anybody has any thoughts, I would be happy to hear them, especially
before I start hacking at the code. Of course, there is always the .zip
file if things don't work out.
Thanks,
Bill
--
Dr. William H. Jones
MS 5-11
NASA Lewis Research Center
21000 Brookpark Road
Cleveland, OH 44135
216 433 5862
Preferred: William.H.Jones@grc.nasa.gov
Hard code: enjones@witsend.grc.nasa.gov
Personal: whjones@apk.net
Project Integration Architecture:
http://www.grc.nasa.gov/WWW/price000/index.html
No man is so fortunate but that, at the hour of his doom,
he will not have at least a few about him who are pleased
with the proceedings.
-- Marcus Aurelius