Re: NSCA stops working sometimes
Posted: Tue Jul 01, 2014 10:28 am
I see two realistic solutions to this issue, off the top of my head.
1) Add freshness checks to some or all of your passive services, with a timeout well past what their normal interval is. Do something like check_dummy!2!"No passive results returned in 1 hour! Check xinetd."!!! This way you always get a critical and it is informative about what is failing.
2) A second option that is similar but might provide a bit better way to handle it. Have check that submits a passive check to nsca (presuming that is your choice for passive results) have the script then sleep for 10-30 seconds and when it wakes up, check the nagios service via webui, json api, etc, and return OK only if the passive check was received. If it was not, you could use this option to kick off a local event handler and restart the xinetd service and resolve the issue immediately.
1) Add freshness checks to some or all of your passive services, with a timeout well past what their normal interval is. Do something like check_dummy!2!"No passive results returned in 1 hour! Check xinetd."!!! This way you always get a critical and it is informative about what is failing.
2) A second option that is similar but might provide a bit better way to handle it. Have check that submits a passive check to nsca (presuming that is your choice for passive results) have the script then sleep for 10-30 seconds and when it wakes up, check the nagios service via webui, json api, etc, and return OK only if the passive check was received. If it was not, you could use this option to kick off a local event handler and restart the xinetd service and resolve the issue immediately.