Invalid Freshness Alarms

vinzclortho · Post by **vinzclortho** » Fri Mar 20, 2015 4:23 pm

And one more note - we have the default max_check_result_file_age value (3600). Shouldn't this prevent these really old check results from being processed?

Post by **Box293** » Sun Mar 22, 2015 7:15 pm

If you watch the check_result_path are there any files sitting in there for a long time? 12 hours?

vinzclortho · Post by **vinzclortho** » Mon Mar 23, 2015 10:37 am

I have been keeping an eye on stuff in check_result_path. There is generally nothing in there that's more than a minute or so old. A few of the servers did have some very old files - weeks or months old. I suspect these were from crashes - we previously had issues with livestatus causing nagios to crash. I have moved the old results out.

I am now running "while true; do find . -amin 2; sleep 30; done" in the check_result_path on all four hosts and have not seen anything yet.

vinzclortho · Post by **vinzclortho** » Mon Mar 23, 2015 12:45 pm

I have modified my script to tail the nagios.debug log and watch for old ( > 150 second ) commands being pushed in. It has turned up lots of stuff :

Code: Select all

[1427128588.816659] [128.1] [pid=9556] Command Entry Time: 1427128385
large diff : 203 - 1427128588 1427128385
[1427131034.299819] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 133 - 1427131034 1427130901
[1427131049.299997] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 148 - 1427131049 1427130901
[1427131064.300130] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 163 - 1427131064 1427130901
[1427131079.300334] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 178 - 1427131079 1427130901
[1427131274.355924] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 132 - 1427131274 1427131142
[1427131289.356070] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 147 - 1427131289 1427131142
[1427131319.390478] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 177 - 1427131319 1427131142
[1427132309.562493] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 123 - 1427132309 1427132186
[1427132324.562721] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 138 - 1427132324 1427132186

During this time, I haven't seen anything > 2 minutes old in the checkresults directory. Furthermore, grepping through the checkresults dir never turns up any of these passive checks that we are pushing in. Everything in the checkresults dir seems to be coming from our check_mk agent checks.

Notice that several of these are the same commands being pushed in multiple times. Are passive results sent in via the command file queued or cached somewhere else within nagios? These repeated commands are only being pushed in once via gostatus.

tmcdonald · Post by **tmcdonald** » Tue Mar 24, 2015 5:13 pm

Do you by chance use NTP? I suppose in theory we could be seeing some time drift correction that would account for a few seconds' drift in either direction, and then ntpd corrects this drift causing either a past or a future discrepancy. Things typically work well when server time is lagging, but not so well when it is in the future.

vinzclortho · Post by **vinzclortho** » Wed Mar 25, 2015 9:54 am

We do have NTP configured and monitored with the built-in check_mk agent checks. None of the hosts are off more than a few milliseconds. Even if they were, that would not explain checks that have a 0 timestamp or the repeated checks.

Post by **tgriep** » Wed Mar 25, 2015 4:40 pm

Can you post or PM the scripts/checks that you are using and how they are ran so we can review them?

vinzclortho · Post by **vinzclortho** » Thu Mar 26, 2015 11:23 am

After some spelunking in the nagios source, I think I have discovered how this was happening, if not necessarily why.

I set debug_verbosity=2, which adds "Raw command entry:" logging. From the source, nagios logs Raw Commands before it processes any of the command entries. I noticed that there is a bunch of cruft that builds up in the 'cmd' variable that is logged. For instance :

Code: Select all

[1427382625.039946] [128.1] [pid=26200] External Command Type: 30
[1427382625.039955] [128.1] [pid=26200] Command Entry Time: 1427382625
[1427382625.039959] [128.1] [pid=26200] Command Arguments: host-103;Memcached Check cache-02;0;OK - memcached on host-103:11213 alive: elapsed 0.315 ms
[1427382625.129476] [128.2] [pid=26200] Raw command entry: COMMAND [1427382625] PROCESS_SERVICE_CHECK_RESULT;host-105;Memcached Check cache;0;OK - memcached on host-105:11212 alive: elapsed 0.261 ms
sed 0.315 ms
=4731533
9448
ime=4778691
k cache-02;0;OK - memcached on host-104:11213 alive: elapsed 0.288 ms
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 502 Error Rate;0;OK:  0
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 504 Error Rate;0;OK:  0
219 alive: elapsed 0.189 ms

Everything after the Raw command entry: line is bogus old check data. Notice that the timestamps are 7 seconds old at this point - this was just after a nagios restart. The cruft continues to build up in this variable over time. A longer log showing the cruft building up and changing is available at https://gist.githubusercontent.com/enic ... tfile1.txt

I don't know why this is - gostatus writes newline-delimited text to the command file and it is processed normally almost all of the time. I tried disabling livestatus to see if it was somehow stomping on gostatus in the command file, but I still saw the corruption.

In the meantime, we have rewritten gostatus to write files to the check_result_path rather than submitting these checks to the command pipe. It has not been running in production for long, but we have not seen any issues with late timestamps so far.

I will call this fixed in a few days if the check_results files approach works, but it would still be nice to know why the command file was getting corrupted like that.

cmerchant · Post by **cmerchant** » Thu Mar 26, 2015 2:49 pm

It seems that the command pipe should work, although I don't know if there is a lock mechanism that would prevent the clobbered write. Keep us posted, thanks.

Post by **Box293** » Thu Mar 26, 2015 4:47 pm

I was playing around with the command pipe the other day and I did observe similar behaviour.

What I found was, if you didn't submit he command with a "\n" at the end then it wouldn't work but was "appended/kept in the buffer" and then later when a correct command was issued the previous kept information was also submitted.

You can see the location of "\n" in the following example:

Code: Select all

now_epoch=$(eval date +%s); printf "[$now_epoch] ADD_HOST_COMMENT;centos03;1;nagiosadmin;This host does funny stuff\n" > /usr/local/nagios/var/rw/nagios.cmd

Nagios Support Forum

Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms

Re: Invalid Freshness Alarms