Re: [Nagios-devel] Make sockets non-blocking

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Make sockets non-blocking

Post by Guest »

On 05/05/2010 03:13 PM, Stephen Gran wrote:
> On Sat, May 01, 2010 at 05:03:59PM +0100, Stephen Gran said:
>> Hi there,
>>
>> We use NDO for network communication with a custom bit of perl to pass
>> status updates around. Recently, we've seen that a network flap can
>> make ndo hang the entire nagios process, which is possibly imperfect :)
>>
>> I think I've tracked it down to the write() call in io.c when sending
>> the actual update to the remote server. The attached patch is a
>> relatively naive attempt at making this write nonblocking for network
>> sockets.
>>
>> This is a patch against the CVS - if you prefer a git-am style patch,
>> that's fine. I tried to clone the git off of sourceforge this morning,
>> and got an empty repo. If there's a better place to clone from, let me
>> know and I'll fix it up for that.
>
> So, it turned out my initial attempt to keep the patch small had some
> limitations. Working patch attached.
>
> To recap, the main problem is that I/O operations are blocking. This is
> less important to local file or unix sockets, but can block the main
> nagios process when the I/O operations are tcp based.
>

It will block on unix sockets too, in case the reader goes to lunch so
the socket buffer fills up.

> My first attempt merely marked the socket as non-blocking, and added
> the optional return code to the list handled in the error path. What I
> found during testing was that this had a few problems.
>
> First, the error path adds the return of write() to tbytes. If write
> returns -1, tbytes was being decremented, resulting in an infinite loop
> because the loop termination condition became unsatisfiable. Even when
> correcting the return to 0 before addition, there was still no loop
> termination condition when write could not succeed. I've hackishly
> corrected this with a hardcoded maximum number of loops.
>
> To make things a little nicer, we don't even want to enter the write()
> loop if we know we can't write(). We do this with a zero second select()
> to check if we can write before entering the loop. This is admittedly
> racy, but I'd frankly rather return early than block the nagios main
> process.
>

Humm. I've solved this exact problem in Merlin with a 100 millisecond
timeout and, failing writability on the socket, just referring to a
binary backlog which stashes events for me until I need them.

The binlog api is ridiculously simple and very easy to work with, and
it's separated to its own source + header file. You might want to grab
that instead of hacking around with blocking calls and possibly partial
writes inside the module (which the reader then has to deal with).

> Back to socket creation. We could mark the socket as non-blocking
> after connect() returns, so that we know that we have a valid fd before
> carrying on. The problem with this is that the default connect() timeout
> on the Redhat 5 machine I tested this on is 3 minutes. That is, in my
> opinion, again too long a time to block the main nagios process for.
>

Definitely. You want to set it to non-blocking, fire off the connect()
and then check if it's writable to see if the connection succeeded. The
polling can be done later in a scheduled event of its own, since the
call will return immediately on a non-blocking socket and therefore
most likely won't have time to even reach the remote end before you
poll it otherwise.

> What I've done instead is to mark the socket as non-blocking before
> calling connect(). In the connect() routine, if connect() sets errno
> to EINPROGRESS, we select() on the socket for 15 seconds to see if it
> succeeds. If it does not, we enter the normal error path for connect()
> failures.
>

15 seconds sounds like a lot imo.

> Arguably my choices of 10 loops for termination in write() and 15
> seconds for connect() are not right for everyone. They could be moved
> to configuration options, or they could be taken from existing options,
> or something. At this point, it Works For Me(TM), which is good enough
> at this point. If people have any specific objections, I woul

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked