<div dir="ltr"><div>HI guys,</div><div><br></div><div>I have continued to work on deadlock issue by compiling omniORB library from sources, gathered from <br></div><div><a href="http://sourceforge.net/projects/omniorb/files/omniORB/omniORB-4.2.2/omniORB-4.2.2.tar.bz2">http://sourceforge.net/projects/omniorb/files/omniORB/omniORB-4.2.2/omniORB-4.2.2.tar.bz2</a>.</div><div><br></div><div>After compiling it with debug information & preventing optimization (./configure CXXFLAGS="-g -O0") and after I came back into a deadlock situation I was able to get some more information about the reason for the deadlock issue. I believe the problem is that the same thread is trying to lock the same mutex again. The following gdb backtrace depict the problematic thread:</div><div><span style="font-family:monospace"><br></span></div><div> <span style="font-family:monospace">(gdb) bt<br>#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135<br>#1 0x0000000004e4e025 in __GI___pthread_mutex_lock (mutex=0x66646f0) at ../nptl/pthread_mutex_lock.c:80<br>#2 0x0000000004c35fd7 in ?? () from /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so<br>#3 0x000000000085357d in omni_mutex::lock (this=0x66646f0) at /usr/include/omnithread.h:255<br>#4 0x0000000000a212e4 in omni_mutex_lock::omni_mutex_lock (this=0xcc28af0, m=...) at ../../../../include/omnithread.h:299<br>#5 0x0000000000b0c009 in omni::giopStream::errorOnReceive (this=0x67cbea8, rc=0, filename=0xbcd661 "giopStream.cc", lineno=404, buf=0x0, heldlock=true, <br> message=0xbcd718 "Timed out sleeping on read lock (waiting for receiver thread)") at giopStream.cc:718<br>#6 0x0000000000b0b3b6 in omni::giopStream::sleepOnRdLockAlways (this=0x67cbea8) at giopStream.cc:404<br>#7 0x0000000000b27e9b in omni::giopImpl12::inputReplyBegin (g=0x67cbea8, unmarshalHeader=0xb28a6e <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at giopImpl12.cc:557<br>#8 0x0000000000b28194 in omni::giopImpl12::inputMessageBegin (g=0x67cbea8, unmarshalHeader=0xb28a6e <omni::giopImpl12::unmarshalLocateReply(omni::giopStream*)>) at giopImpl12.cc:626<br>#9 0x0000000000b152bb in omni::GIOP_C::IssueLocateRequest (this=0x67cbea0) at GIOP_C.cc:167<br>#10 0x0000000000af0f9e in omniRemoteIdentity::locateRequest (this=0xa350a00, call_desc=...) at remoteIdentity.cc:206<br>#11 0x0000000000af076a in omniRemoteIdentity::dispatch (this=0xa350a00, call_desc=...) at remoteIdentity.cc:82<br>#12 0x0000000000ad2465 in omniObjRef::_invoke (this=0xa350aa0, call_desc=..., do_assert=false) at omniObjRef.cc:675</span></div><div><span style="font-family:monospace">......</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:arial,sans-serif">In <span style="font-family:monospace">omni::giopImpl12::inputReplyBegin()</span> omniTransportLock is locked (line 538) and later on in <span style="font-family:monospace">omni::giopStream::errorOnReceive()</span> function (line 718) the same mutex is trying to be locked again.<br><br></span></div><div><span style="font-family:monospace"><span style="font-family:monospace"><span style="font-family:arial,sans-serif">After this finding I have checked the code on 4.2.3 version and found the problem has already been fixed by modifying <span style="font-family:monospace">errorOnReceive()</span> function and locking mutex optionally and not unconditionally. <br><br></span></span></span></div><div><span style="font-family:monospace"><span style="font-family:monospace"><span style="font-family:arial,sans-serif">I believe this thread can be closed and with the 4.2.3 version our deadlock issue will be overcome.<br></span></span></span></div><div><span style="font-family:monospace"><span style="font-family:monospace"><br><br></span></span></div><div><span style="font-family:monospace"><span style="font-family:monospace"></span></span></div><div>Thanks,<br></div><div>Damijan<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 5, 2021 at 3:09 PM Damijan Skvarc <<a href="mailto:damjan.skvarc@gmail.com">damjan.skvarc@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi.</div><div><br></div><div>With the omniORB4.2.2 version (delivered by Ubunto 18.04) we have started facing a deadlock issue. Once I was able to attach with gdb into the process where such a deadlock happened and found the following situation. There was 21 threads, where one of the thread was found in:<br></div><div><pre>#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f691f213025 in __GI___pthread_mutex_lock (mutex=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f691fc1d6c0 in omni::giopStream::errorOnReceive(int, char const*, unsigned int, omni::giopStream_Buffer*, bool, char const*) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#3 0x00007f691fc1da4b in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#4 0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#5 0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#6 0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#7 0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2<br><br></pre><pre><span style="font-family:arial,sans-serif">while the other 20 threads were stuck in </span><br><br>#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f691f214fb3 in __pthread_mutex_cond_lock (mutex=mutex@entry=0x55c31d6f4e00) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f691f216e7d in __pthread_cond_wait_common (abstime=0x7f68c5613eb0, mutex=0x55c31d6f4e00, cond=0x55c31d82b9d8) at pthread_cond_wait.c:645
#3 __pthread_cond_timedwait (cond=0x55c31d82b9d8, mutex=0x55c31d6f4e00, abstime=0x7f68c5613eb0) at pthread_cond_wait.c:667
#4 0x00007f691f90ca84 in omni_condition::timedwait(unsigned long, unsigned long) () from /usr/lib/x86_64-linux-gnu/libomnithread.so.4
#5 0x00007f691fc1d9f1 in omni::giopStream::sleepOnRdLockAlways() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#6 0x00007f691fc369ac in omni::giopImpl12::inputReplyBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#7 0x00007f691fc37c17 in omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#8 0x00007f691fc252fc in omni::GIOP_C::ReceiveReply() () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#9 0x00007f691fc047e8 in omniRemoteIdentity::dispatch(omniCallDescriptor&) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2
#10 0x00007f691fbe7b3c in omniObjRef::_invoke(omniCallDescriptor&, bool) () from /usr/lib/x86_64-linux-gnu/libomniORB4.so.2utex, but the call stack<br></pre><div>By inspecting pthread mutex at address0x55c31d6f4e00 I found the first thread already owned this mutex, but the call stack depicts the thread is trying to lock it again. It looks the thread for some reason "forget" to unlock it in one of the previous steps.<br><br>Since I couldn't get any additional data I have started to instrument the code with helgrind. Unfortunately, due to different timing conditions, I was not able to come into the same obstacle again. I have instrumented all versions from 4.2.0 up to 4.2.4 and in all these versions I've found one another raise condition which happened in the following case:</div><div><br></div><div>Possible<span style="font-family:monospace"> data race during read of size 1 at 0x67C9275 by thread #1<br>==9765== Locks held: none<br>==9765== at 0xB5ACBA: omni::SocketHolder::setNonBlocking() (SocketCollection.h:108)<br>==9765== by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long, omni_time_t const&) (tcpConnection.cc:71)<br>==9765== by 0xB16D19: omni::giopStream::sendChunk(omni::giopStream_Buffer*) (giopStream.cc:1129)<br>==9765== by 0xB3391A: omni::giopImpl12::outputMessageEnd(omni::giopStream*) (giopImpl12.cc:1320)<br>==9765== by 0xB1E881: omni::GIOP_C::InitialiseRequest() (GIOP_C.cc:109)<br>==9765== by 0xAF9363: omniRemoteIdentity::dispatch(omniCallDescriptor&) (remoteIdentity.cc:97)<br>==9765== by 0xADAF14: omniObjRef::_invoke(omniCallDescriptor&, bool) (omniObjRef.cc:675)<br>==9765== by 0x873DCD: i_NET::_objref_RemoteNode::GetFlags(i_NET::NodeNames const&) (i_remote_node.cc:1418)<br>==9765== by 0x82B49C: mci::RemoteNode::OnGetFlags(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_<br>traits<char>, std::allocator<char> > > > const&) const (remote_node.cpp:333)<br>==9765== by 0x822E00: mci::TreeNode::GetFlags(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_trai<br>ts<char>, std::allocator<char> > > > const&) const (tree_node.cpp:94)<br>==9765== by 0x7A6190: mci::Node::GetFlags() const (node.cpp:312)<br>==9765== by 0x68C0B2: MciProxy::Connect(mci::Root, unsigned int) (mciProxy.cpp:56)<br>==9765== <br>==9765== This conflicts with a previous write of size 1 by thread #8<br>==9765== Locks held: none<br>==9765== at 0xB5ACA2: omni::SocketHolder::setBlocking() (SocketCollection.h:101)<br>==9765== by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long, omni_time_t const&) (tcpConnection.cc:155)<br>==9765== by 0xB15CA6: omni::giopStream::inputMessage() (giopStream.cc:844)<br>==9765== by 0xB31149: omni::giopImpl12::inputNewServerMessage(omni::giopStream*) (giopImpl12.cc:438)<br>==9765== by 0xB319CB: omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) (giopImpl12.cc:646)<br>==9765== by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)<br>==9765== by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)<br>==9765== by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)</span><br><br></div><div>This raise condition happens between write & read operation, however there are also cases where raise happens also between write & write operation what is more problematic. For example:</div><div><br></div><div><span style="font-family:monospace">==9765== Possible data race during write of size 1 at 0x67C9275 by thread #7<br>==9765== Locks held: none<br>==9765== at 0xB5ACD7: omni::SocketHolder::setNonBlocking() (SocketCollection.h:110)<br>==9765== by 0xB5A39D: omni::tcpConnection::Send(void*, unsigned long, omni_time_t const&) (tcpConnection.cc:71)<br>==9765== by 0xB10FC6: omni::sendCloseConnection(omni::giopStrand*) (giopStrand.cc:121)<br>==9765== by 0xB12AE4: omni::Scavenger::execute() (giopStrand.cc:698)<br>==9765== by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)<br>==9765== by 0xAC5885: omniAsyncPoolGeneral::workerRun(omniAsyncWorker*) (invoker.cc:297)<br>==9765== by 0xAC3F45: omniAsyncWorker::mid_run() (invoker.cc:511)<br>==9765== by 0xAC4614: omniAsyncWorkerInfo::run() (invoker.cc:662)<br>==9765== by 0xAC55C1: omniAsyncWorker::run(void*) (invoker.cc:126)<br>==9765== by 0xB67F33: omni_thread_wrapper (posix.cc:459)<br>==9765== by 0x4C38C26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)<br>==9765== by 0x4E4B6DA: start_thread (pthread_create.c:463)<br>==9765== <br>==9765== This conflicts with a previous write of size 1 by thread #8<br>==9765== Locks held: none<br>==9765== at 0xB5ACA2: omni::SocketHolder::setBlocking() (SocketCollection.h:101)<br>==9765== by 0xB5A55D: omni::tcpConnection::Recv(void*, unsigned long, omni_time_t const&) (tcpConnection.cc:155)<br>==9765== by 0xB15CA6: omni::giopStream::inputMessage() (giopStream.cc:844)<br>==9765== by 0xB31149: omni::giopImpl12::inputNewServerMessage(omni::giopStream*) (giopImpl12.cc:438)<br>==9765== by 0xB319CB: omni::giopImpl12::inputMessageBegin(omni::giopStream*, void (*)(omni::giopStream*)) (giopImpl12.cc:646)<br>==9765== by 0xB209EF: omni::GIOP_S::dispatcher() (GIOP_S.cc:137)<br>==9765== by 0xB1DE1E: omni::giopWorker::execute() (giopWorker.cc:77)<br>==9765== by 0xAC43A1: omniAsyncWorker::real_run() (invoker.cc:578)</span><br></div><div><br></div><div></div><div>Obviously, the tcpConnection entity is not thread safe. <br><br>I don't know if this problem leads to the aforementioned deadlock issue, however I am a bit confused, since I don't know if omniOrb library is responsible for such kind of locking or me, as a user of omniOrb library, should handle this manually somehow.<br><br></div><div>Would you mind to give me some advice.<br></div><div><br></div><div>Thanks, <br>Damijan Skvarc,<br></div><div>Senior Software Engineer<br></div><div>Instrumentation Technologies, Slovenia<br></div><pre><span style="font-family:arial,sans-serif"></span></pre></div></div>
</blockquote></div>