Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Hi. We have a fresh build of Nagios 4.3.4 on CentOS 7 that is receiving passive host / service checks from numerous systems via gearman.
Everything works fine except that, for a handful of systems, all the service checks go stale 30 seconds after the check is received (the host checks are fine).
Below are the log entries for one of the services on one of the systems:
[Wed Nov 22 22:34:21 2017] PASSIVE SERVICE CHECK: host;System-Partitions;0;DISK OK
[Wed Nov 22 22:34:21 2017] SERVICE ALERT: host;System-Partitions;OK;HARD;1;DISK OK
[Wed Nov 22 22:35:01 2017] Warning: The results of service 'System-Partitions' on host 'host' are stale by 0d 1h 15m 30s (threshold=0d 0h 14m 0s). I'm forcing an immediate check of the service.
[Wed Nov 22 22:35:11 2017] SERVICE ALERT: host;System-Partitions;CRITICAL;HARD;1;CRITICAL: No Recent Passive Service Checks.
[Wed Nov 22 22:39:21 2017] PASSIVE SERVICE CHECK: host;System-Partitions;0;DISK OK
[Wed Nov 22 22:39:21 2017] SERVICE ALERT: host;System-Partitions;OK;HARD;1;DISK OK
[Wed Nov 22 22:40:01 2017] Warning: The results of service 'System-Partitions' on host 'host' are stale by 0d 1h 15m 30s (threshold=0d 0h 14m 0s). I'm forcing an immediate check of the service.
[Wed Nov 22 22:40:11 2017] SERVICE ALERT: host;System-Partitions;CRITICAL;HARD;1;CRITICAL: No Recent Passive Service Checks.
[Wed Nov 22 22:44:21 2017] PASSIVE SERVICE CHECK: host;System-Partitions;0;DISK OK
[Wed Nov 22 22:44:21 2017] SERVICE ALERT: host;System-Partitions;OK;HARD;1;DISK OK
[Wed Nov 22 22:45:01 2017] Warning: The results of service 'System-Partitions' on host 'host' are stale by 0d 1h 15m 30s (threshold=0d 0h 14m 0s). I'm forcing an immediate check of the service.
[Wed Nov 22 22:45:11 2017] SERVICE ALERT: host;System-Partitions;CRITICAL;HARD;1;CRITICAL: No Recent Passive Service Checks.
[Wed Nov 22 22:49:21 2017] PASSIVE SERVICE CHECK: host;System-Partitions;0;DISK OK
[Wed Nov 22 22:49:21 2017] SERVICE ALERT: host;System-Partitions;OK;HARD;1;DISK OK
[Wed Nov 22 22:50:01 2017] Warning: The results of service 'System-Partitions' on host 'host' are stale by 0d 1h 15m 29s (threshold=0d 0h 14m 0s). I'm forcing an immediate check of the service.
[Wed Nov 22 22:50:11 2017] SERVICE ALERT: host;System-Partitions;CRITICAL;HARD;1;CRITICAL: No Recent Passive Service Checks.
[Wed Nov 22 22:54:31 2017] PASSIVE SERVICE CHECK: host;System-Partitions;0;DISK OK
[Wed Nov 22 22:54:31 2017] SERVICE ALERT: host;System-Partitions;OK;HARD;1;DISK OK
[Wed Nov 22 22:55:00 2017] Warning: The results of service 'System-Partitions' on host 'host' are stale by 0d 1h 15m 27s (threshold=0d 0h 14m 0s). I'm forcing an immediate check of the service.
[Wed Nov 22 22:55:11 2017] SERVICE ALERT: host;System-Partitions;CRITICAL;HARD;1;CRITICAL: No Recent Passive Service Checks.
As you can see, the passive service check is received, the service alert is set to OK, and then 30 seconds later there is a warning that the service checks are stale and the service alert is set to CRITICAL.
As I say, we have exactly the same host and service checks on numerous systems that don't exhibit this behaviour.
Does anyone know why this is happening for the handful of systems?
---------------------------------------------------------------------------------------------------------------------
object.cache file is quite big... so just showing relevant information for one site to monitor.
########################################
# NAGIOS OBJECT CACHE FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
#
# Created: Fri Nov 24 10:26:29 2017
########################################
define timeperiod {
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
define timeperiod {
timeperiod_name 24x7_sans_holidays
alias 24x7 Sans Holidays
december 25 00:00-00:00
july 4 00:00-00:00
january 1 00:00-00:00
thursday 4 november 00:00-00:00
monday 1 september 00:00-00:00
monday -1 may 00:00-00:00
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
define timeperiod {
timeperiod_name none
alias No Time Is A Good Time
}
define timeperiod {
timeperiod_name us-holidays
alias U.S. Holidays
january 1 00:00-00:00
july 4 00:00-00:00
december 25 00:00-00:00
monday -1 may 00:00-00:00
monday 1 september 00:00-00:00
thursday 4 november 00:00-00:00
}
The settings for those checks look like they should work. You should check the settings for Mod Gearman, maybe that is sending old checks in causing the Freshness to be triggered.
Be sure to check out our Knowledgebase for helpful articles and solutions!
As far as I can tell there is nothing wrong with mod_gearman.
Is the freshness checking based on when the checks were received, or some timestamp included in the check?
According to the logs, the checks are received less than a minute before the "stale" warning. If Nagios is not using the time the check was received to determine the freshness, what else could it be using?
The freshness should be checked against from when the check was received.
Can you pm me your oblects.cache file and that status.dat file from your server as well as the host name and service name that is having the issue?
If they are large, you will have to zip them up first.
Thanks
Note: PM Received and shared with the other Techs.
Be sure to check out our Knowledgebase for helpful articles and solutions!
tgriep wrote:The freshness should be checked against from when the check was received.
Can you pm me your oblects.cache file and that status.dat file from your server as well as the host name and service name that is having the issue?
If they are large, you will have to zip them up first.
Thanks
What looks like is happening is that the service check is not updating with the current status of the check when an OK state come in.
It could be caused by a bad entry in Nagios's status files.
To fix that, you would have to stop the nagios process and delete the retention.dat file and then start the nagios process so it can be rebuilt.
Couple of things that happen when this it done.
Any notes added to an object and any downtime will be lost.
Also, the system will act like is is first starting so it will recheck all hosts and services so be prepared for that.
Be sure to check out our Knowledgebase for helpful articles and solutions!
tgriep wrote:What looks like is happening is that the service check is not updating with the current status of the check when an OK state come in.
It could be caused by a bad entry in Nagios's status files.
To fix that, you would have to stop the nagios process and delete the retention.dat file and then start the nagios process so it can be rebuilt.
Couple of things that happen when this it done.
Any notes added to an object and any downtime will be lost.
Also, the system will act like is is first starting so it will recheck all hosts and services so be prepared for that.
Thanks for the assistance. I have implemented the changes as requested but unfortunately the issue continues to occur.