Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I'm experiencing a lot of false alarms from passive checks. I have raised the threshold to a much longer time than needed to minimize the false alarms, but still a lot is coming through. Looking at the log there seems to be an apparent bug since the log says a service is stale by 16829d 6h 38m 22s (threshold=0d 0h 21m 40s). This seems to be since the epoch, but the service was recently checked.
[1454050801] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates). |available_upgrades=0;;;0 critical_updates=0;;;0
[1454050801] PASSIVE SERVICE CHECK: srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates).
[1454050802] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;thc-cyg-hcapp-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates). |available_upgrades=0;;;0 critical_updates=0;;;0
[1454050802] PASSIVE SERVICE CHECK: thc-cyg-hcapp-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates).
[1454050802] Warning: The results of service 'Apt' on host 'srv-app03-test101' are stale by 16829d 6h 38m 22s (threshold=0d 0h 21m 40s). I'm forcing an immediate check of the service.
[1454050802] SERVICE ALERT: srv-app03-test101;Apt;WARNING;HARD;1;WARNING: Missing report. This does not necessarily indicate an error.
[1454050802] SERVICE NOTIFICATION: kajsa;srv-app03-test101;Apt;WARNING;notify-by-email;WARNING: Missing report. This does not necessarily indicate an error.
[1454050802] SERVICE NOTIFICATION: kalle;srv-app03-test101;Apt;WARNING;notify-by-email;WARNING: Missing report. This does not necessarily indicate an error.
[1454051102] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;thc-cyg-hcapp-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates). |available_upgrades=0;;;0 critical_updates=0;;;0
[1454051102] PASSIVE SERVICE CHECK: thc-cyg-hcapp-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates).
[1454051102] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates). |available_upgrades=0;;;0 critical_updates=0;;;0
[1454051102] PASSIVE SERVICE CHECK: srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates).
[1454051102] SERVICE ALERT: srv-app03-test101;Apt;OK;HARD;1;APT OK: 0 packages available for upgrade (0 critical updates).
[1454051102] SERVICE NOTIFICATION: kajsa;srv-app03-test101;Apt;OK;notify-by-email;APT OK: 0 packages available for upgrade (0 critical updates).
[1454051102] SERVICE NOTIFICATION: kalle;srv-app03-test101;Apt;OK;notify-by-email;APT OK: 0 packages available for upgrade (0 critical updates).
[1454051402] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates). |available_upgrades=0;;;0 critical_updates=0;;;0
[1454051402] PASSIVE SERVICE CHECK: srv-app03-test101;Apt;0;APT OK: 0 packages available for upgrade (0 critical updates).
[quote="hsmith"]Is the time on your system off by chance?
No, I don't think so. The time is correct. However this is on a virtual server. I'll see if I can investigate this lead in more depth. Thanks for the input!
[1454505675] Warning: The results of service 'CPU' on host 'all-xxx1' are stale by 0d 0h 4m 1s (threshold=0d 0h 6m 0s) (current_time=1454505675, expiration_time=1454505434). I'm forcing an immediate check of the service.
[1454505675] Warning: The results of service 'Disk' on host 'all-xxx1' are stale by 0d 0h 4m 1s (threshold=0d 0h 6m 0s) (current_time=1454505675, expiration_time=1454505434). I'm forcing an immediate check of the service.
[1454505675] Warning: The results of service 'Memory' on host 'all-xxx1' are stale by 0d 0h 4m 1s (threshold=0d 0h 6m 0s) (current_time=1454505675, expiration_time=1454505434). I'm forcing an immediate check of the service.
[1454505675] Warning: The results of service 'Processes' on host 'all-xxx1' are stale by 0d 0h 4m 1s (threshold=0d 0h 6m 0s) (current_time=1454505675, expiration_time=1454505434). I'm forcing an immediate check of the service.
[1454506125] Warning: The results of service 'Swap usage' on host 'hms-xxx1' are stale by 16834d 9h 9m 40s (threshold=0d 0h 16m 40s) (current_time=1454506125, expiration_time=15545). I'm forcing an immediate check of the service.
There seems to be an issue with the expiration_time now and then. I'm a little stuck here ....
I did not find a solution to this. I do not have the time to investigate it further. Did an ugly workaround by adding the following to the is_service_result_fresh and the corresponding in is_host_result_fresh
/* Added by MR. Just check for insanely small expiration times */
if (expiration_time < 1400000000) {
logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: The results of service '%s' on host '%s' are stale by %dd %dh %dm %ds (threshold=%dd %dh %dm %ds) (current_time=%d, expiration_time=%d) but it's too much. Letting it pass.\n", temp_service->description, temp_service->host_name, days, hours, minutes, seconds, tdays, thours, tminutes, tseconds, (int)current_time, (int)expiration_time);
log_debug_info(DEBUGL_CHECKS, 1, "Check results for service '%s' on host '%s' are stale by %dd %dh %dm %ds (threshold=%dd %dh %dm %ds) but it's too much. Letting it pass.\n", temp_service->description, temp_service->host_name, days, hours, minutes, seconds, tdays, thours, tminutes, tseconds);
return TRUE;
}
It is now working as I expect it. I'll put on my todo list to really find the source of the problem.