Page 1 of 2

NSCA close/POLLNVAL/accept bug - patch attached

Posted: Tue Sep 30, 2014 3:53 am
by mib
nsca-2.7.2-uom.patch
(1.28 KiB) Downloaded 225 times
I believe I've found a significant bug in the nsca daemon. Under heavy connection load we are seeing this hang nsca (forever) every few days, and more frequently freeze until the remote disconnects.

In summary: nsca closes client connection file descriptors on error or eof and relies on poll returning POLLNVAL to clean up after them (in particular to remove their handlers and take them out of the poll set). If in the same loop accept() returns the same FD as a recently closed FD the wrong handler will be run on it.

This can result in calling recv() on an FD with no data to read and O_NONBLOCK not set.

Awiddershiem's commit to nsca-aw here here works around this bug by setting O_NONBLOCK earlier, but that code is quite divergent from the nsca core now.

Attached (I hope) is a minimal-change patch that fixes the behavior. It's not the best patch, which would be to redesign nsca to manage the handlers and remove the one-shot behavior where the loop clears them each time (which results in amusing comments such as "DO NOT REMOVE! 01/29/2007 single process daemon will fail if this is removed").

This patch is against 2.7.2 but it should apply cleanly to the current r2763. It includes a fix to another problem I don't think can actually be triggered, where the poll loop could theoretically run a handler for a just-added FD because npfds++ is run inside the loop which tests against <npfds.

I'm happy to go through the bug code path in more detail if anyone wants, it was quite tricky to track down.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Tue Sep 30, 2014 10:24 am
by tmcdonald
Thanks for the patch! While the support forums get seen a lot by the support staff, our developers are busy high up on a snow-covered mountain peak in a monastery working on their code and they don't often visit the forums. A great place to submit bug reports (or better yet, bug reports with patches included) would be http://tracker.nagios.com for our Nagios XI product, and http://tracker.nagios.org for our Nagios Core product.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Tue Sep 30, 2014 6:48 pm
by mib
Is NSCA part of Nagios Core? When I downloaded the Nagios Core tarball it wasn't included.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Tue Sep 30, 2014 7:00 pm
by mib
I have logged this on the tracker as:

http://tracker.nagios.org/view.php?id=644

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Wed Oct 01, 2014 3:28 pm
by sreinhardt
Thanks for the tracker bug! No core and nsca are separate products and are not inclusive of each other. They do work together, but one is not needed for the other to operate in any way.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Wed Oct 01, 2014 8:55 pm
by mib
My question is really:

Who, if anyone, is maintaining NSCA and would fix this bug?

Given that it appears the commercial Nagios XI doesn't use NSCA, and NSCA is not part of nagios-core. Two-year-old bug reports on the NSCA part of the tracker are still in "NEW" state.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Thu Oct 02, 2014 10:42 am
by sreinhardt
Actually, we do use nsca in several of our commercial products, and do maintain it as issues become apparent. Most likely either myself or one of the other C devs will take a look and modify or apply the patch as needed. I have some plugins work that is going to take precedence, but I will absolutely take a look at this patch and implementation of it if no one else beats me to it.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Fri Oct 03, 2014 12:37 am
by mib
Fantastic, thanks for your prompt responses and assistance.

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Fri Oct 03, 2014 12:04 pm
by abrist
No problem. We actually maintain the github repo for nsca: https://github.com/NagiosEnterprises/nsca
It would be a great help if you could submit your patch with your explanation there.
Thanks!

Re: NSCA close/POLLNVAL/accept bug - patch attached

Posted: Mon Oct 06, 2014 2:48 am
by mib
The minimal work-around patch is in this commit:

https://github.com/mbattersby/nsca/comm ... 468d03f514

for which I also sent a pull request.