Invalid Freshness Alarms

vinzclortho · Post by **vinzclortho** » Mon Mar 16, 2015 10:52 am

Hello,
We are having an issue with invalid alarms on freshness checks. This happens randomly - perhaps once a day total across several hundred services on four primary nagios servers. Some days we get none, some days we get two or three, and there does not appear to be any pattern in the times that we get the alarms. In all cases, we get staleness alarms for services that are definitely not stale. It does not matter how long we set the freshness_threshold, they always alarm.

A bit about our environment :
* Nagios 4.0.8
* Check_MK/Livestatus 1.2.5i5p4
* Two primary servers in two datacenters - one DC with ~500 hosts / 16.5k services and one with ~200 hosts / 6.5k services
* Custom golang process (gostatus) accepting passive checks and writing them to status.dat

We have crons that run every minute for several applications we run in our environment like mysql, redis, memcached, etc. The crons gather statistics and status on each application then send stats to statsd and passive service checks via gostatus into nagios. The crons have a delay specified on the command line that is randomized so that we don't send in hundreds of checks every minute on the minute.

Here is an example service definition :
define service {
use check_mk_perf,default-service
host_name host-02
service_description MySQL Status database2
check_command check_passive
service_groups staging_mysql,gostatus
use passive-service-cron
notifications_enabled 1
}

And the passive-service-cron template :
define service {
register 0
name passive-service-cron
max_check_attempts 1
use default-service
active_checks_enabled 0
check_freshness 1
freshness_threshold 330
}

And here is an example log of a bogus alarm : https://gist.githubusercontent.com/enic ... tfile1.txt

I have interleaved several logfiles here based on time to make it more readable. We have the normal nagios.log, nagios debugging with debug_level 128 in nagios.debug, logs from our passive service writing process in gostatus.log and cron logs from the host in cron.log.

As you can see, we are submitting checks every minute. The gostatus.log entries are exactly what it is writing to status.dat. Everything looks totally normal until it alarms. In this case, it appears that nagios writes an entry with a 0 timestamp at 1426375694. This causes nagios to think that it hasn't gotten a check since 1970 and alarm.

Here is another example : https://gist.githubusercontent.com/enic ... tfile1.txt

This one is slightly different. In this case, we see some random check results show up in nagios.debug that are not being sent by the crons or gostatus. These results have ** in front of them. Apparently one of them is a replay of the command sent at 1426353130, as that is the timestamp that corresponds to the threshold + staleness time of 0d 0h 0m 20s + (threshold=0d 0h 5m 30s).

I do not have any explanation for the 0 timestamps or the randomly-generated old results being pushed in. I have spent weeks trying to debug this and have ruled out everything I can besides nagios internals. I would love any assistance you can provide to further debug this and figure out what's going on. These checks in general work very well - again we have hundreds of them being submitted every minute and only see issues perhaps once a day at random times.

Thanks,

Eron Nicholson
System Administrator
Basecamp

Post by **Box293** » Mon Mar 16, 2015 5:37 pm

Very strange.

One thing I have noticed is your gostatus.log entries have timestamps with varying decimal places:

Code: Select all

23:17:14.224983269
23:18:14.366517562
23:19:14.459773712
23:20:14.567259155
23:21:14.675603368
23:22:14.814155656
23:23:14.883750066
23:24:14.9982242
23:25:14.098640712
23:26:14.244302458
23:27:14.33311269
23:28:14.419685631
23:29:14.510542165

Code: Select all

17:10:09.662572232
17:11:09.839836388
17:12:10.01137049
17:13:09.207804922
17:14:09.363063987
17:15:09.538403893
17:16:09.710912875
17:17:09.893075621
17:18:10.069849664

I'm not sure if those timestamps are being used to then calculate the number used in the [square brackets] fields. If they are, perhaps set a fixed number of decimal points before being calculated. I know it's not likely the cause, but in odd issues like this you need to make sure everything is consistent.

Are you using a ramdisk at all? Perhaps the nagios.cmd should reside on a ramdisk?

vinzclortho · Post by **vinzclortho** » Mon Mar 16, 2015 6:21 pm

Thanks for the ideas.

Those timestamps with variable decimal length are just generated by gostatus as it logs - they are not used the calculate the command timestamps. The command timestamps within []s are generated by the cron scripts when they submit the commands.

As far as the ramdisk recommendation - it's possible that we could try that, but why do you think it would be effective? The servers are all SSD-backed and I tend to doubt that cmd file IO latency is an issue. We generally see near-zero IOwait on these machines and do not see any spikes when this happens.

Thanks,

Eron

cmerchant · Post by **cmerchant** » Tue Mar 17, 2015 4:08 pm

The ramdisk would IMHO help, not just for I/O speed, reduced path length , but for I/O avoidance - especially for extending the service life of SSD's.

How big is your status.dat currently?

Just wondering why some random delays are 8 seconds and some are 13. Have you calculated out the # of service checks updates/minute? Is this dedicated hardware or running in VM?

vinzclortho · Post by **vinzclortho** » Wed Mar 18, 2015 12:09 pm

I have added a ramdisk on one host in each datacenter. Unfortunately, we probably won't know whether it has any effect for a few days since these alarms are so infrequent.

I moved status.dat, nagios.cmd, tmp_files, objects.cache and objects.precache to the ramdisk.

status.dat is 25m on the larger install, 10m on the smaller.

The cron delays are set by chef when we add crons to hosts - the value is just output from rand(20). The random delays aren't really relevant to the freshness alarms - we still get check results every minute for all those checks since the crons aren't changed. I just included the note about them to explain why the results in logs come in later than the crons fire in the logs.

We see an average of 3k passive and 1k active check results per minute in the larger cluster and 0.024 sec average check latency.

Check load does not appear to be a factor - we see there alarms just as frequently on the smaller cluster.

This is dedicated hardware - 8 cores and 16g of ram. The host load is < 1.

cmerchant · Post by **cmerchant** » Wed Mar 18, 2015 2:49 pm

Thanks for the update. We'll wait and see whether the changes you made have cleared the freshness alarms. Keep us posted. Thanks.

vinzclortho · Post by **vinzclortho** » Thu Mar 19, 2015 11:15 am

FYI, we had another bogus alarm last night, so the ramdisk did not help. Unfortunately, that server did not have a nagios debug_level set, so I don't have full debugging info, just cron, nagios.log and gostatus logs :

https://gist.githubusercontent.com/enic ... tfile1.txt

I have rolled out the ramdisk settings and debug_level to all four hosts, so we should get nagios.debug info on the next one.

cmerchant · Post by **cmerchant** » Thu Mar 19, 2015 1:31 pm

Warning: The results of service 'MySQL Free Connections database3' on host 'host-03' are stale by 0d 0h 0m 1s (threshold=0d 0h 5m 30s). I'm forcing an immediate check of the service.

Yeow, a stale check by 1 second! That is random! Let us know when you get one more stale alert. Thanks.

vinzclortho · Post by **vinzclortho** » Fri Mar 20, 2015 3:20 pm

Ok, got another alarm today and got a bunch of very interesting info.

https://gist.githubusercontent.com/enic ... tfile1.txt

As you can see from the nagios.debug log, somehow gostatus pushed a command 12 hours ago and it just showed up then. Very odd.

I wrote a quick script to parse the nagios.debug log to compare the logged timestamp with the command timestamp in lines like :
[1426847123.061241] [128.1] [pid=9556] Command Entry Time: 1426800078

I discovered that our less-loaded datacenter has lots of stuff that's pushed in late. Example output from the script. The Max value is the one that alarmed in that log :

61793 commands parsed
221 more than 150 seconds late
Mean - 5.319308012234395
Min - -13
Max - 47045

Our more-loaded datacenter does not have much pushed in late, but it does have a handful of things pushed in with a timestamp of 0. I am ignoring TS 0 commands to calculate the mean :

TS 0 :
[1426769052.605709] [128.1] [pid=23992] Command Entry Time: 0
TS 0 :
[1426797562.762583] [128.1] [pid=23992] Command Entry Time: 0
TS 0 :
[1426802320.514468] [128.1] [pid=10106] Command Entry Time: 0
TS 0 :
[1426812930.629103] [128.1] [pid=10106] Command Entry Time: 0
173776 commands parsed
0 more than 150 seconds late
Mean - 1.916910275296934
Min - -2
Max - 1426812930

Here's more info on one of those timestamp 0 entries...

gostatus pushes it in with a valid timestamp :
2015-03-19 20:39:22.762422221 +0000 UTC: COMMAND [1426797562] PROCESS_SERVICE_CHECK_RESULT;host-05;Redis Uptime Check redis05;0;OK - Redis instance at host-05:6379 uptime is 4408097 seconds. | uptime=4408097

nagios.debug logs it nearly instantly, with a timestamp of 0 :
[1426797562.762583] [128.1] [pid=23992] Command Entry Time: 0
[1426797562.762586] [128.1] [pid=23992] Command Arguments: host-05;Redis Uptime Check redis05;0;OK - Redis instance at host-05:6379 uptime is 4408097 seconds. | uptime=4408097

So no matter how long we set the freshness threshold to be, it won't matter if commands are being processed with timestamps that are 12 hours old, or 0.

vinzclortho · Post by **vinzclortho** » Fri Mar 20, 2015 3:44 pm

One more note on the 12-hour old command - it does not show up in nagios.debug at the time that it was sent in by gostatus. So it's not as if it is an old result that got accepted once and then replayed somehow. It appears to have been submitted, then sat on the cmd file for 12+ hours, then picked up by nagios.

Thanks again for the help tracking this down!

Nagios Support Forum

Invalid Freshness Alarms

Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms