NSCA close/POLLNVAL/accept bug - patch attached
Posted: Tue Sep 30, 2014 3:53 am
I believe I've found a significant bug in the nsca daemon. Under heavy connection load we are seeing this hang nsca (forever) every few days, and more frequently freeze until the remote disconnects.
In summary: nsca closes client connection file descriptors on error or eof and relies on poll returning POLLNVAL to clean up after them (in particular to remove their handlers and take them out of the poll set). If in the same loop accept() returns the same FD as a recently closed FD the wrong handler will be run on it.
This can result in calling recv() on an FD with no data to read and O_NONBLOCK not set.
Awiddershiem's commit to nsca-aw here here works around this bug by setting O_NONBLOCK earlier, but that code is quite divergent from the nsca core now.
Attached (I hope) is a minimal-change patch that fixes the behavior. It's not the best patch, which would be to redesign nsca to manage the handlers and remove the one-shot behavior where the loop clears them each time (which results in amusing comments such as "DO NOT REMOVE! 01/29/2007 single process daemon will fail if this is removed").
This patch is against 2.7.2 but it should apply cleanly to the current r2763. It includes a fix to another problem I don't think can actually be triggered, where the poll loop could theoretically run a handler for a just-added FD because npfds++ is run inside the loop which tests against <npfds.
I'm happy to go through the bug code path in more detail if anyone wants, it was quite tricky to track down.
In summary: nsca closes client connection file descriptors on error or eof and relies on poll returning POLLNVAL to clean up after them (in particular to remove their handlers and take them out of the poll set). If in the same loop accept() returns the same FD as a recently closed FD the wrong handler will be run on it.
This can result in calling recv() on an FD with no data to read and O_NONBLOCK not set.
Awiddershiem's commit to nsca-aw here here works around this bug by setting O_NONBLOCK earlier, but that code is quite divergent from the nsca core now.
Attached (I hope) is a minimal-change patch that fixes the behavior. It's not the best patch, which would be to redesign nsca to manage the handlers and remove the one-shot behavior where the loop clears them each time (which results in amusing comments such as "DO NOT REMOVE! 01/29/2007 single process daemon will fail if this is removed").
This patch is against 2.7.2 but it should apply cleanly to the current r2763. It includes a fix to another problem I don't think can actually be triggered, where the poll loop could theoretically run a handler for a just-added FD because npfds++ is run inside the loop which tests against <npfds.
I'm happy to go through the bug code path in more detail if anyone wants, it was quite tricky to track down.