Page 1 of 1

Re: [Nagios-devel] Make sockets non-blocking

Posted: Wed May 05, 2010 9:32 pm
by Guest

--LQksG6bCIzRHxTLp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, May 05, 2010 at 10:54:48PM +0200, Andreas Ericsson said:
> On 05/05/2010 03:13 PM, Stephen Gran wrote:
> > On Sat, May 01, 2010 at 05:03:59PM +0100, Stephen Gran said:
> >> Hi there,
> >>
> >> We use NDO for network communication with a custom bit of perl to pass
> >> status updates around. Recently, we've seen that a network flap can
> >> make ndo hang the entire nagios process, which is possibly imperfect :)
> >>
> >> I think I've tracked it down to the write() call in io.c when sending
> >> the actual update to the remote server. The attached patch is a
> >> relatively naive attempt at making this write nonblocking for network
> >> sockets.
> >>
> >> This is a patch against the CVS - if you prefer a git-am style patch,
> >> that's fine. I tried to clone the git off of sourceforge this morning,
> >> and got an empty repo. If there's a better place to clone from, let me
> >> know and I'll fix it up for that.
> >=20
> > So, it turned out my initial attempt to keep the patch small had some
> > limitations. Working patch attached.
> >=20
> > To recap, the main problem is that I/O operations are blocking. This is
> > less important to local file or unix sockets, but can block the main
> > nagios process when the I/O operations are tcp based.
> >=20
>=20
> It will block on unix sockets too, in case the reader goes to lunch so
> the socket buffer fills up.

Good point. Well, it will be easy enough to add the logic there as
well, if someone wants to.

> > My first attempt merely marked the socket as non-blocking, and added
> > the optional return code to the list handled in the error path. What I
> > found during testing was that this had a few problems.
> >=20
> > First, the error path adds the return of write() to tbytes. If write
> > returns -1, tbytes was being decremented, resulting in an infinite loop
> > because the loop termination condition became unsatisfiable. Even when
> > correcting the return to 0 before addition, there was still no loop
> > termination condition when write could not succeed. I've hackishly
> > corrected this with a hardcoded maximum number of loops.
> >=20
> > To make things a little nicer, we don't even want to enter the write()
> > loop if we know we can't write(). We do this with a zero second select=
()
> > to check if we can write before entering the loop. This is admittedly
> > racy, but I'd frankly rather return early than block the nagios main
> > process.
>=20
> Humm. I've solved this exact problem in Merlin with a 100 millisecond
> timeout and, failing writability on the socket, just referring to a
> binary backlog which stashes events for me until I need them.
>=20
> The binlog api is ridiculously simple and very easy to work with, and
> it's separated to its own source + header file. You might want to grab
> that instead of hacking around with blocking calls and possibly partial
> writes inside the module (which the reader then has to deal with).

Yes, scrapping the entire thing and redoing it did occur to me, but I
thought it might not be the best "hello world" patch on the project
mailing list :)

I wasn't too worried about dealing with message queueing, as NDO already
does that part fairly well (as I'm sure you're aware). I think that
short term, the simplest logic may be to scrap the while loop altogether
and just mark the message as failed if you get a partial write. That
way the recipient can ditch it without worrying about reassembling
partial messages, and NDO can deal with it with it's usual retry logic.

That too felt like too big a change for my first patch, but that is the
way I was leaning while looking at it.

> > Back to socket creation. We could mark the socket as non-blocking
> > after connect() returns, so that we know that we have a valid fd before
> > carrying on. The problem with this is that the default connect() timeo=
ut
> > on the Redhat 5 machine I tested this on is 3 minutes. That is, in my
> > opi

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]