Page 1 of 1

Suggestion for System Status

Posted: Tue Jul 02, 2013 10:22 pm
by Box293
Occassionally I have come across a problem where performance graphs were no longer being updated. My fix each time was to restart the Nagios XI host and everthing came good again. Recently I fiqured out what is going on and I have a suggestion, depending if my findings are correct.

In a nutshell, for one reason or another, when the NCPD Daemon is terminated, the Nagios XI system status does not detect that this has occurred. To the admin looking at Nagios XI the system is OK "six ticks along the top bar".

I discovered this on a test XI VM I deployed recently. I added a windows server to test some new checks and when I came back a day later there were no performance graphs being generated.

I found that in the "/usr/local/nagios/var/spool/perfdata" folder there was about 80,000 files.

When I had a look at "/usr/local/nagios/var/npcd.log" I found the following line:

Code: Select all

[06-07-2013 15:28:16] NPCD: WARN: MAX load reached: load 12.210000/10.000000 at i=0[06-23-2013 08:59:08] NPCD: Caught Termination Signal - Hasta la vista... baby
Ah hah :idea: . I was not exactly sure how to start NPCD again so I simply rebooted the Nagios XI VM. Once it was back up and running again the amount of files in spool/perfdata reduced over time and eventually it was working as expected. The server now had performance graphs which included all of the spooled perfdata that backed up.

I understand why the NPCD Daemon was terminated, this is not a discussion about max loads etc.

From what I can determine, when the NPCD Daemon is terminated, and the spool/perfdata files are building up, there is no Monitoring Engine Status / System status check / dashlet that identifies there is a problem with NPCD.

So my suggestion is, perhaps in the XI System Component Status, or the System OK status (or somewhere else) should include a check to alert the admin when the NPCD Daemon is terminated / spool/perfdata files are building up.

I could be wrong though, but this is just some observed behaviour that I've finally been able to pinpoint.

Re: Suggestion for System Status

Posted: Wed Jul 03, 2013 9:32 am
by sreinhardt
Actually, both the system status page, and the check marks in the upper right should show the status of npcd. It should be the second from the left, under performance grapher. Also while npcd is stopped temporarily if max load is reached it should not be indefinitely stopped at this point. It sounds more like it is closing improperly and the pid file that we check for status is not getting removed. I am not sure if the logic is there or not, but along the same lines it would probably be a good idea for us to take that pid file and verify the process is actually running instead of blindly accepting that.

Re: Suggestion for System Status

Posted: Thu Jul 04, 2013 5:49 pm
by Box293
I think I've seen both behaviours in relation to the system status. Becuase I've only come across the problem about every 4-6 months my memory isn't the best lol. But I do remember in the past the performance grapher icon being the exclamation mark, and this most recent one it was a tick.

As you say, it's probably related to the logic around the pid file and improving on what already exists.

Re: Suggestion for System Status

Posted: Mon Jul 08, 2013 10:44 am
by sreinhardt
Ha, good old memory! I'll take a look into the logic of how we are checking to verify. Maybe we can make some changes to it if the logic isn't there!