Re: [Nagios-devel] Make sockets non-blocking
Posted: Wed May 05, 2010 12:13 pm
--oyUTqETQ0mS9luUI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Sat, May 01, 2010 at 05:03:59PM +0100, Stephen Gran said:
> Hi there,
>=20
> We use NDO for network communication with a custom bit of perl to pass
> status updates around. Recently, we've seen that a network flap can
> make ndo hang the entire nagios process, which is possibly imperfect
>=20
> I think I've tracked it down to the write() call in io.c when sending
> the actual update to the remote server. The attached patch is a
> relatively naive attempt at making this write nonblocking for network
> sockets.
>=20
> This is a patch against the CVS - if you prefer a git-am style patch,
> that's fine. I tried to clone the git off of sourceforge this morning,
> and got an empty repo. If there's a better place to clone from, let me
> know and I'll fix it up for that.
So, it turned out my initial attempt to keep the patch small had some
limitations. Working patch attached.
To recap, the main problem is that I/O operations are blocking. This is
less important to local file or unix sockets, but can block the main
nagios process when the I/O operations are tcp based.
My first attempt merely marked the socket as non-blocking, and added
the optional return code to the list handled in the error path. What I
found during testing was that this had a few problems.
First, the error path adds the return of write() to tbytes. If write
returns -1, tbytes was being decremented, resulting in an infinite loop
because the loop termination condition became unsatisfiable. Even when
correcting the return to 0 before addition, there was still no loop
termination condition when write could not succeed. I've hackishly
corrected this with a hardcoded maximum number of loops.
To make things a little nicer, we don't even want to enter the write()
loop if we know we can't write(). We do this with a zero second select()
to check if we can write before entering the loop. This is admittedly
racy, but I'd frankly rather return early than block the nagios main
process.
Back to socket creation. We could mark the socket as non-blocking
after connect() returns, so that we know that we have a valid fd before
carrying on. The problem with this is that the default connect() timeout
on the Redhat 5 machine I tested this on is 3 minutes. That is, in my
opinion, again too long a time to block the main nagios process for.
What I've done instead is to mark the socket as non-blocking before
calling connect(). In the connect() routine, if connect() sets errno
to EINPROGRESS, we select() on the socket for 15 seconds to see if it
succeeds. If it does not, we enter the normal error path for connect()
failures.
Arguably my choices of 10 loops for termination in write() and 15
seconds for connect() are not right for everyone. They could be moved
to configuration options, or they could be taken from existing options,
or something. At this point, it Works For Me(TM), which is good enough
at this point. If people have any specific objections, I would of course
be happy to tidy it up for them.
Cheers,
diff -Nru ndoutils-1.4b7-bak/src/io.c ndoutils-1.4b7/src/io.c
--- ndoutils-1.4b7-bak/src/io.c 2007-01-08 00:35:50.000000000 +0000
+++ ndoutils-1.4b7/src/io.c 2010-05-05 13:41:42.000000000 +0100
@@ -203,12 +203,34 @@
server_address_i.sin_family=3DAF_INET;=20
server_address_i.sin_port=3Dhtons(port);
=20
+ int x;
+ x=3Dfcntl(newfd,F_GETFL,0);
+ fcntl(newfd,F_SETFL,x | O_NONBLOCK);
+
/* connect to the socket */
if((connect(newfd,(struct sockaddr *)&server_address_i,size=
of(server_address_i)))){
- close(newfd);
- return NDO_ERROR;
- }
+ if(errno=3D=3DEINPROGRESS) {
+ fd_set wfds;
+ int retval =3D 0;
+ struct timeval tv;
+
+ tv.tv_sec =3D 15;
+
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]