Page 1 of 1

NPCD System Status Issue

Posted: Tue Apr 21, 2015 11:35 pm
by rseiwert
While this might seem a lot like some of my other recent topics it is something completely different. Last night I noticed my graphs had stopped updating. I did check and npcd is reported as running. Further investigation showed it was not. Just yet another issue where XI system status is not providing true updates.
ImageImage

Checking from the command line
[root@nagios nagios]# /etc/init.d/npcd status
NPCD running (pid 1519).
[root@nagios nagios]# ps -ef | grep 1519 | grep -v grep
root 1519 1 0 Apr21 ? 00:00:02 crond
root 64932 1519 0 00:23 ? 00:00:00 CROND
root 64933 1519 0 00:23 ? 00:00:00 CROND
root 64934 1519 0 00:23 ? 00:00:00 CROND
root 64935 1519 0 00:23 ? 00:00:00 CROND
root 64936 1519 0 00:23 ? 00:00:00 CROND

Of course after clicking the gear and restarting npcd you can guess what happened next. Cron jobs stopped running. Of course all nagios cron processes stopped at that point.
[root@nagios nagios]# ps -ef | grep crond | grep -v grep
[root@nagios nagios]#

Yet another time where sysstat.php (what drives those green checks) reported bogus info and where the XI interface killed off critical system components because it looked at a PID in a file without bothering to check if it really was that process. These system health indicators need to do more than to trust the init script. Improving the init.d script is the first step but if there if stale performance data queuing up and not being processed maybe the performance grapher is not running and doesn't deserve a green check mark.

Re: NPCD System Status Issue

Posted: Wed Apr 22, 2015 1:51 pm
by jdalrymple
You're right, obviously the method we're using (calling the init script) is inadequate to get valid data on the service. We'll either have to update the init scripts ourselves or have the poller work around it.

I'll share your situation with the devs and file a bug.

Re: NPCD System Status Issue

Posted: Wed Apr 22, 2015 8:18 pm
by Box293
I made some feature requests about providing some more "localhost" service checks to detect problems like spooled check results building up. Feel free to try them out and if you think they could be useful comment on them in tracker.

http://tracker.nagios.com/view.php?id=635
http://tracker.nagios.com/view.php?id=636
http://tracker.nagios.com/view.php?id=641