Suggestion for System Status
Posted: Tue Jul 02, 2013 10:22 pm
Occassionally I have come across a problem where performance graphs were no longer being updated. My fix each time was to restart the Nagios XI host and everthing came good again. Recently I fiqured out what is going on and I have a suggestion, depending if my findings are correct.
In a nutshell, for one reason or another, when the NCPD Daemon is terminated, the Nagios XI system status does not detect that this has occurred. To the admin looking at Nagios XI the system is OK "six ticks along the top bar".
I discovered this on a test XI VM I deployed recently. I added a windows server to test some new checks and when I came back a day later there were no performance graphs being generated.
I found that in the "/usr/local/nagios/var/spool/perfdata" folder there was about 80,000 files.
When I had a look at "/usr/local/nagios/var/npcd.log" I found the following line:
Ah hah
. I was not exactly sure how to start NPCD again so I simply rebooted the Nagios XI VM. Once it was back up and running again the amount of files in spool/perfdata reduced over time and eventually it was working as expected. The server now had performance graphs which included all of the spooled perfdata that backed up.
I understand why the NPCD Daemon was terminated, this is not a discussion about max loads etc.
From what I can determine, when the NPCD Daemon is terminated, and the spool/perfdata files are building up, there is no Monitoring Engine Status / System status check / dashlet that identifies there is a problem with NPCD.
So my suggestion is, perhaps in the XI System Component Status, or the System OK status (or somewhere else) should include a check to alert the admin when the NPCD Daemon is terminated / spool/perfdata files are building up.
I could be wrong though, but this is just some observed behaviour that I've finally been able to pinpoint.
In a nutshell, for one reason or another, when the NCPD Daemon is terminated, the Nagios XI system status does not detect that this has occurred. To the admin looking at Nagios XI the system is OK "six ticks along the top bar".
I discovered this on a test XI VM I deployed recently. I added a windows server to test some new checks and when I came back a day later there were no performance graphs being generated.
I found that in the "/usr/local/nagios/var/spool/perfdata" folder there was about 80,000 files.
When I had a look at "/usr/local/nagios/var/npcd.log" I found the following line:
Code: Select all
[06-07-2013 15:28:16] NPCD: WARN: MAX load reached: load 12.210000/10.000000 at i=0[06-23-2013 08:59:08] NPCD: Caught Termination Signal - Hasta la vista... babyI understand why the NPCD Daemon was terminated, this is not a discussion about max loads etc.
From what I can determine, when the NPCD Daemon is terminated, and the spool/perfdata files are building up, there is no Monitoring Engine Status / System status check / dashlet that identifies there is a problem with NPCD.
So my suggestion is, perhaps in the XI System Component Status, or the System OK status (or somewhere else) should include a check to alert the admin when the NPCD Daemon is terminated / spool/perfdata files are building up.
I could be wrong though, but this is just some observed behaviour that I've finally been able to pinpoint.