Permission reset script did not resolve the issue, even after a nagios&ndo2db stop start.
Running on RHEL7 with systemd, so no /etc/init.d/nagios. I checked a working machine and none there either. I also grepped for "nagiosretentionfile" across both systems and there was none.
I have sent the profile to you by PM.
It is starting to get a bit annoying as we are doing testing on a few hundred systems to set thresholds, and we currently have 13K checks to work throuhgh....
I think I might have found the issue. I logged into Core to check, and got:
Error: Could not open CGI config file '/usr/local/nagios/etc/cgi.cfg' for reading!
Broken Machine:
drwxrwxr-x 7 apache nagios 4096 Nov 15 16:08 .
drwxr-xr-x 9 apache nagios 94 Oct 30 11:13 ..
-rw-rw-r-- 1 apache nagios 35 Nov 13 17:05 cgi.cfg
Working machine:
drwxrwxr-x 7 apache nagios 4096 Nov 14 11:21 .
drwxr-xr-x 9 apache nagios 94 Sep 24 11:54 ..
-rw-rw-r-- 1 apache nagios 2150 Nov 13 17:06 cgi.cfg
@bomahony, Glad you were able to locate the problem. I'm not sure why the cgi.cfg file got corrupted, but that can definitely cause this problem.
I looked through your profile and everything looked good except for some performance data warnings. For example, your system load reaches up to 110 from time to time and perfdata stops getting processed at 10. And the timeout is set to only 5 seconds. This can possibly cause some perf data to get lost.
You could follow this tutorial to increase the load_threshold and the timeout setting for the NPCD: https://support.nagios.com/kb/article.php?id=9
However, keep in mind that the more resources you dedicate to the NPCD the less are left for Nagios.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Thanks for the info. TBH we are still in the build/test phase of the DC. Machine currently has 6 vCPU and 16GB RAM. I can increase these if it will help. I will also look at ncpd increases.
I will probably have to have an in depth look at the load issue first, when I get time.
Where were you getting these values from historically, so i can have a look?