NCPA Stop Working - No errors found
Posted: Fri Dec 11, 2015 1:22 pm
First off, my apologies if this is not the right place to post this. I haven't been on the forums in years, usually find answers by finding stuff on the web. I didn't know if this would be the appropriate place for my question, but I searched for, and found, other NCPA posts that seemed to have been posted here before.
I've been monitoring a decently sized network for a couple of years now, with no problem. We turned up a new platform of about 15-20 hosts that is not directly accessible by our Nagios Core host. I installed and setup NCPA to do passive monitoring of these hosts. After figuring out passive monitoring, I loaded up each of the compute nodes with a simple NCPA config that monitored the root disk space and checked for the status of 4 or 5 critical processes. The latter was done via plugins and a check_services script. Everything was working and I was beaming with pride in this new insight into the health of this platform.
Less than 12 hours later I was surprised to receive an alert from one of the hosts. Initially I was worried about the problem, but realized that *all* of the services were being reported as down. That meant either the box died or NCPA malfunctioned. I logged in, checked the status of everything, all was good. I restarted the ncpa service, boom, everything restored as good within a minute. I shrugged it off as a hiccup. Then a few hours later another box did the exact same thing. Logged in, verified the services were good, verified that the NCPA service was running, checked the log (INFO level) and saw nothing. Restarted ncpa_passive and everything was good yet again. I changed all boxes to debug level logging, and when it happened twice more I checked the log and saw nothing. By all appearance it seems that ncpa_passive just... stops. No errors, the process is still running, it's just not doing anything. The last log message is from the prior service checks.
Of the 16 or so boxes I am monitoring I am averaging 1-2 of these events per day. I can understand the occasional hiccup, but this is happening *way* too often. I'm running the current, latest NCPA release. Installed on virtually identical Centos 7 installations. The boxes that go down are different each time, so its not like I have a 'problem child' that's having the problem. I didn't post my config, because I didn't believe it would be relevant to an issue such as this. The service works just fine, sometimes for days, then just... stops.
I'd hate to try an install another solution because NCPA is the future, cross-platform, and seems to be working well, just, need it to be reliable. Without any errors in the logs, Im not sure what else I can say or do to fix the problem.
Has anyone else had a similar issue? Have an idea for further troubleshooting?
FWIW, I've been running NCPA for quite a long time now on a few Windows servers in our environment, active checks though, and they've been working just fine.
I've been monitoring a decently sized network for a couple of years now, with no problem. We turned up a new platform of about 15-20 hosts that is not directly accessible by our Nagios Core host. I installed and setup NCPA to do passive monitoring of these hosts. After figuring out passive monitoring, I loaded up each of the compute nodes with a simple NCPA config that monitored the root disk space and checked for the status of 4 or 5 critical processes. The latter was done via plugins and a check_services script. Everything was working and I was beaming with pride in this new insight into the health of this platform.
Less than 12 hours later I was surprised to receive an alert from one of the hosts. Initially I was worried about the problem, but realized that *all* of the services were being reported as down. That meant either the box died or NCPA malfunctioned. I logged in, checked the status of everything, all was good. I restarted the ncpa service, boom, everything restored as good within a minute. I shrugged it off as a hiccup. Then a few hours later another box did the exact same thing. Logged in, verified the services were good, verified that the NCPA service was running, checked the log (INFO level) and saw nothing. Restarted ncpa_passive and everything was good yet again. I changed all boxes to debug level logging, and when it happened twice more I checked the log and saw nothing. By all appearance it seems that ncpa_passive just... stops. No errors, the process is still running, it's just not doing anything. The last log message is from the prior service checks.
Of the 16 or so boxes I am monitoring I am averaging 1-2 of these events per day. I can understand the occasional hiccup, but this is happening *way* too often. I'm running the current, latest NCPA release. Installed on virtually identical Centos 7 installations. The boxes that go down are different each time, so its not like I have a 'problem child' that's having the problem. I didn't post my config, because I didn't believe it would be relevant to an issue such as this. The service works just fine, sometimes for days, then just... stops.
I'd hate to try an install another solution because NCPA is the future, cross-platform, and seems to be working well, just, need it to be reliable. Without any errors in the logs, Im not sure what else I can say or do to fix the problem.
Has anyone else had a similar issue? Have an idea for further troubleshooting?
FWIW, I've been running NCPA for quite a long time now on a few Windows servers in our environment, active checks though, and they've been working just fine.