First off, my apologies if this is not the right place to post this. I haven't been on the forums in years, usually find answers by finding stuff on the web. I didn't know if this would be the appropriate place for my question, but I searched for, and found, other NCPA posts that seemed to have been posted here before.
I've been monitoring a decently sized network for a couple of years now, with no problem. We turned up a new platform of about 15-20 hosts that is not directly accessible by our Nagios Core host. I installed and setup NCPA to do passive monitoring of these hosts. After figuring out passive monitoring, I loaded up each of the compute nodes with a simple NCPA config that monitored the root disk space and checked for the status of 4 or 5 critical processes. The latter was done via plugins and a check_services script. Everything was working and I was beaming with pride in this new insight into the health of this platform.
Less than 12 hours later I was surprised to receive an alert from one of the hosts. Initially I was worried about the problem, but realized that *all* of the services were being reported as down. That meant either the box died or NCPA malfunctioned. I logged in, checked the status of everything, all was good. I restarted the ncpa service, boom, everything restored as good within a minute. I shrugged it off as a hiccup. Then a few hours later another box did the exact same thing. Logged in, verified the services were good, verified that the NCPA service was running, checked the log (INFO level) and saw nothing. Restarted ncpa_passive and everything was good yet again. I changed all boxes to debug level logging, and when it happened twice more I checked the log and saw nothing. By all appearance it seems that ncpa_passive just... stops. No errors, the process is still running, it's just not doing anything. The last log message is from the prior service checks.
Of the 16 or so boxes I am monitoring I am averaging 1-2 of these events per day. I can understand the occasional hiccup, but this is happening *way* too often. I'm running the current, latest NCPA release. Installed on virtually identical Centos 7 installations. The boxes that go down are different each time, so its not like I have a 'problem child' that's having the problem. I didn't post my config, because I didn't believe it would be relevant to an issue such as this. The service works just fine, sometimes for days, then just... stops.
I'd hate to try an install another solution because NCPA is the future, cross-platform, and seems to be working well, just, need it to be reliable. Without any errors in the logs, Im not sure what else I can say or do to fix the problem.
Has anyone else had a similar issue? Have an idea for further troubleshooting?
FWIW, I've been running NCPA for quite a long time now on a few Windows servers in our environment, active checks though, and they've been working just fine.
NCPA Stop Working - No errors found
-
pottedFern
- Posts: 3
- Joined: Fri Dec 11, 2015 12:50 pm
Re: NCPA Stop Working - No errors found
ncpa_passive.log
There's nothing in the main system log.
There's nothing in the main system log.
-
pottedFern
- Posts: 3
- Joined: Fri Dec 11, 2015 12:50 pm
Re: NCPA Stop Working - No errors found
So, I had another system stop. I looked a little deeper this time. I was focusing on the agent itself, not anything else. What looks like is happening is the service check script hanging. NCPA is just sitting there waiting for the check to finish, there's no timeout or anything. So they both sit there waiting for something to happen, meanwhile my passive checks aren't coming in and so the system alarms. I'm using a third-party service checker because NCPA's service analysis isn't working. <cough cough>
I'm pretty confident this is the cause of the problem. I am not a programmer, but it seems like NCPA should have a timeout on plugin execution, I'll try and file a bug/request.
I'm pretty confident this is the cause of the problem. I am not a programmer, but it seems like NCPA should have a timeout on plugin execution, I'll try and file a bug/request.
Re: NCPA Stop Working - No errors found
You can use the "-T" flag to specify a timeout...
[root@localhost libexec]# ./check_ncpa.py -h
Usage: check_ncpa.py [options]
Options:
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname to be connected to.
-M METRIC, --metric=METRIC
The metric to check, this is defined on client system.
This would also be the plugin name in the plugins
directory. Do not attach arguments to it, use the -a
directive for that. DO NOT INCLUDE the api/
instruction.
-P PORT, --port=PORT Port to use to connect to the client.
-w WARNING, --warning=WARNING
Warning value to be passed for the check.
-c CRITICAL, --critical=CRITICAL
Critical value to be passed for the check.
-u UNIT, --unit=UNIT The unit prefix (M, G, T)
-n UNITS, --units=UNITS
What should be used in place of the default unit. As
in, instead of 'b' as a unit, it will use this.
-a ARGUMENTS, --arguments=ARGUMENTS
Arguments for the plugin to be run. Not necessary
unless you're running a custom plugin. Given in the
same as you would call from the command line. Example:
-a '-w 10 -c 20 -f /usr/local'
-t TOKEN, --token=TOKEN
The token for connecting.
-T TIMEOUT, --timeout=TIMEOUT
Enforced timeout, will terminate plugins after this
amount of seconds. [15]
-d, --delta Signals that this check is a delta check and a local
state will kept.
-l, --list List all values under a given node. Do not perform a
check.
-v, --verbose Print more verbose error messages.
-s, --super-verbose Print LOTS of error messages.
-V, --version Print version number of plugin.
-q QUERYARGS, --queryargs=QUERYARGS
Be sure to check out our Knowledgebase for helpful articles and solutions!