Invalid Freshness Alarms
Posted: Mon Mar 16, 2015 10:52 am
Hello,
We are having an issue with invalid alarms on freshness checks. This happens randomly - perhaps once a day total across several hundred services on four primary nagios servers. Some days we get none, some days we get two or three, and there does not appear to be any pattern in the times that we get the alarms. In all cases, we get staleness alarms for services that are definitely not stale. It does not matter how long we set the freshness_threshold, they always alarm.
A bit about our environment :
* Nagios 4.0.8
* Check_MK/Livestatus 1.2.5i5p4
* Two primary servers in two datacenters - one DC with ~500 hosts / 16.5k services and one with ~200 hosts / 6.5k services
* Custom golang process (gostatus) accepting passive checks and writing them to status.dat
We have crons that run every minute for several applications we run in our environment like mysql, redis, memcached, etc. The crons gather statistics and status on each application then send stats to statsd and passive service checks via gostatus into nagios. The crons have a delay specified on the command line that is randomized so that we don't send in hundreds of checks every minute on the minute.
Here is an example service definition :
define service {
use check_mk_perf,default-service
host_name host-02
service_description MySQL Status database2
check_command check_passive
service_groups staging_mysql,gostatus
use passive-service-cron
notifications_enabled 1
}
And the passive-service-cron template :
define service {
register 0
name passive-service-cron
max_check_attempts 1
use default-service
active_checks_enabled 0
check_freshness 1
freshness_threshold 330
}
And here is an example log of a bogus alarm : https://gist.githubusercontent.com/enic ... tfile1.txt
I have interleaved several logfiles here based on time to make it more readable. We have the normal nagios.log, nagios debugging with debug_level 128 in nagios.debug, logs from our passive service writing process in gostatus.log and cron logs from the host in cron.log.
As you can see, we are submitting checks every minute. The gostatus.log entries are exactly what it is writing to status.dat. Everything looks totally normal until it alarms. In this case, it appears that nagios writes an entry with a 0 timestamp at 1426375694. This causes nagios to think that it hasn't gotten a check since 1970 and alarm.
Here is another example : https://gist.githubusercontent.com/enic ... tfile1.txt
This one is slightly different. In this case, we see some random check results show up in nagios.debug that are not being sent by the crons or gostatus. These results have ** in front of them. Apparently one of them is a replay of the command sent at 1426353130, as that is the timestamp that corresponds to the threshold + staleness time of 0d 0h 0m 20s + (threshold=0d 0h 5m 30s).
I do not have any explanation for the 0 timestamps or the randomly-generated old results being pushed in. I have spent weeks trying to debug this and have ruled out everything I can besides nagios internals. I would love any assistance you can provide to further debug this and figure out what's going on. These checks in general work very well - again we have hundreds of them being submitted every minute and only see issues perhaps once a day at random times.
Thanks,
Eron Nicholson
System Administrator
Basecamp
We are having an issue with invalid alarms on freshness checks. This happens randomly - perhaps once a day total across several hundred services on four primary nagios servers. Some days we get none, some days we get two or three, and there does not appear to be any pattern in the times that we get the alarms. In all cases, we get staleness alarms for services that are definitely not stale. It does not matter how long we set the freshness_threshold, they always alarm.
A bit about our environment :
* Nagios 4.0.8
* Check_MK/Livestatus 1.2.5i5p4
* Two primary servers in two datacenters - one DC with ~500 hosts / 16.5k services and one with ~200 hosts / 6.5k services
* Custom golang process (gostatus) accepting passive checks and writing them to status.dat
We have crons that run every minute for several applications we run in our environment like mysql, redis, memcached, etc. The crons gather statistics and status on each application then send stats to statsd and passive service checks via gostatus into nagios. The crons have a delay specified on the command line that is randomized so that we don't send in hundreds of checks every minute on the minute.
Here is an example service definition :
define service {
use check_mk_perf,default-service
host_name host-02
service_description MySQL Status database2
check_command check_passive
service_groups staging_mysql,gostatus
use passive-service-cron
notifications_enabled 1
}
And the passive-service-cron template :
define service {
register 0
name passive-service-cron
max_check_attempts 1
use default-service
active_checks_enabled 0
check_freshness 1
freshness_threshold 330
}
And here is an example log of a bogus alarm : https://gist.githubusercontent.com/enic ... tfile1.txt
I have interleaved several logfiles here based on time to make it more readable. We have the normal nagios.log, nagios debugging with debug_level 128 in nagios.debug, logs from our passive service writing process in gostatus.log and cron logs from the host in cron.log.
As you can see, we are submitting checks every minute. The gostatus.log entries are exactly what it is writing to status.dat. Everything looks totally normal until it alarms. In this case, it appears that nagios writes an entry with a 0 timestamp at 1426375694. This causes nagios to think that it hasn't gotten a check since 1970 and alarm.
Here is another example : https://gist.githubusercontent.com/enic ... tfile1.txt
This one is slightly different. In this case, we see some random check results show up in nagios.debug that are not being sent by the crons or gostatus. These results have ** in front of them. Apparently one of them is a replay of the command sent at 1426353130, as that is the timestamp that corresponds to the threshold + staleness time of 0d 0h 0m 20s + (threshold=0d 0h 5m 30s).
I do not have any explanation for the 0 timestamps or the randomly-generated old results being pushed in. I have spent weeks trying to debug this and have ruled out everything I can besides nagios internals. I would love any assistance you can provide to further debug this and figure out what's going on. These checks in general work very well - again we have hundreds of them being submitted every minute and only see issues perhaps once a day at random times.
Thanks,
Eron Nicholson
System Administrator
Basecamp