[omniORB] RE: serious stability problems with omniORB4 snapshots on Solaris 8: bug located!

Mon, 4 Feb 2002 16:04:55 +0100

Hi all,

I've located a race condition in SocketCollection::Select, which causes =
at least one of my problems:

the 'Unrecoverable error for this endpoint: giop:unix:/tmp/echo.bb, it =
will no longer be serviced.' is caused by a race condition in =
SocketCollection::Select. This method first creates a file descriptor =
set and then performs a select on it. However, between the fd_set =
creation and the select call another thread may have closed() a =
connection file descriptor in this set. This causes select() to return =
EBADF ('invalid file descriptor'). Way up in the call chain this is =
translated to an 'unrecoverable error', with known results....

I guess the easiest solution to this problem is to check for EBADF and =
retry the fd_set creation and select() in that case.=20

Any suggestions?

Cheers,

Bastiaan Bakker
LifeLine Networks bv
=20
=09

> -----Original Message-----
> From: Duncan Grisby [mailto:dgrisby@uk.research.att.com]
> Sent: Friday, February 01, 2002 1:55 PM
> To: Bastiaan Bakker
> Cc: omniorb-list@uk.research.att.com
> Subject: Re: [omniORB] RE: serious stability problems with omniORB4
> snapshots on Solaris 8=20
>=20
>=20
> On Friday 1 February, "Bastiaan Bakker" wrote:
>=20
> > On Linux snapshot 20011013 appears stable. With 20020103=20
> and 20020130 I
> > get deadlocks after a while, but no crashes like on=20
> Solaris. So I'm not
> > sure whether it's the same problem.
>=20
> Strange. Nothing to do with the transport code has changed between
> those times. Various things that affect timings have changed, though,
> so it could be a race condition that didn't happen before.
>=20
> Anyway, good luck in tracking it down. I'll look into it soon.
>=20
> Cheers,
>=20
> Duncan.
>=20
> --=20
>  -- Duncan Grisby  \  Research Engineer  --
>   -- AT&T Laboratories Cambridge          --
>    -- http://www.uk.research.att.com/~dpg1 --
>=20