We have two NAGIOS servers ( have the same config parameters) that monitor many servers. However on the primary NAGIOS server we see "Getting check results for service are stale by xxx" just for some of the monitored servers, while on secondary there is no issue at all.
Example of the debug log...
[Tue Apr 3 09:59:59 2018.820447] [016.1] [pid=15267] Check results for service 'Check_cpu_host' on host 'xxx' are stale by 0d 0h 0m 58s (threshold=0d 0h 12m 0s). Forcing an immediate check of the service...
Hello, @Nabi.
I would like to take a look at your system profile to tell whats going on.
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file, upload it to a cloud storage of your choice and share a download link with me via private message.
After that please post something in this thread to bring it back up in the support queue.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
@Nabi, I think you can send it right now since 2 posts are the requirement. You can also upload the file to the thread but keep in mind that other users will be able to see it as well.
A profile was received and shared with the support team.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
The issue of "passive check was not received" happens just on primary NAGIOS server and just on few devices, while the other devices are ok.
The Secondary NAGIOS does not show any issue for any device.
The issue happens on the primary NAGIOS server on random time. Moreover, the issue clears after some min..
So i am not sure if what you asked me for would be helpful here, or u want me to wait until the issue happens and send you some debug log file or so...
Example:
Primary_NAGIOS: zgrep -i USC/usr/local/nagiosxi/cpe_logs/uebNagios.log.20180418_235903.gz
2018-04-18 18:43:21 Nagios alarm for USC Check_passive_nagios CRITICAL
2018-04-18 18:45:22 Nagios alarm for USC Check_passive_nagios OK
I am not sure if there is other way than restarting NAGIOS on this server, because this is NAGIOS production server and we can not mess with it so much.
However, you say there is some database error. Can this error affect just few same devices that are sending the passive checks and leave the others not affected?
Also, do you see the same DB errors please on the Secondary NAGIOS server? because on the Secondary one there is no issue with passive check at all.
However, please let me know if u have other way than restating NAGIOS to debug this issue. If there is not then i will restart it.