Invalid Freshness Alarms

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
vinzclortho
Posts: 13
Joined: Mon Mar 16, 2015 9:14 am

Re: Invalid Freshness Alarms

Post by vinzclortho »

And one more note - we have the default max_check_result_file_age value (3600). Shouldn't this prevent these really old check results from being processed?
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Invalid Freshness Alarms

Post by Box293 »

If you watch the check_result_path are there any files sitting in there for a long time? 12 hours?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
vinzclortho
Posts: 13
Joined: Mon Mar 16, 2015 9:14 am

Re: Invalid Freshness Alarms

Post by vinzclortho »

I have been keeping an eye on stuff in check_result_path. There is generally nothing in there that's more than a minute or so old. A few of the servers did have some very old files - weeks or months old. I suspect these were from crashes - we previously had issues with livestatus causing nagios to crash. I have moved the old results out.

I am now running "while true; do find . -amin 2; sleep 30; done" in the check_result_path on all four hosts and have not seen anything yet.
vinzclortho
Posts: 13
Joined: Mon Mar 16, 2015 9:14 am

Re: Invalid Freshness Alarms

Post by vinzclortho »

I have modified my script to tail the nagios.debug log and watch for old ( > 150 second ) commands being pushed in. It has turned up lots of stuff :

Code: Select all

[1427128588.816659] [128.1] [pid=9556] Command Entry Time: 1427128385
large diff : 203 - 1427128588 1427128385
[1427131034.299819] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 133 - 1427131034 1427130901
[1427131049.299997] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 148 - 1427131049 1427130901
[1427131064.300130] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 163 - 1427131064 1427130901
[1427131079.300334] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 178 - 1427131079 1427130901
[1427131274.355924] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 132 - 1427131274 1427131142
[1427131289.356070] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 147 - 1427131289 1427131142
[1427131319.390478] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 177 - 1427131319 1427131142
[1427132309.562493] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 123 - 1427132309 1427132186
[1427132324.562721] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 138 - 1427132324 1427132186
During this time, I haven't seen anything > 2 minutes old in the checkresults directory. Furthermore, grepping through the checkresults dir never turns up any of these passive checks that we are pushing in. Everything in the checkresults dir seems to be coming from our check_mk agent checks.

Notice that several of these are the same commands being pushed in multiple times. Are passive results sent in via the command file queued or cached somewhere else within nagios? These repeated commands are only being pushed in once via gostatus.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Invalid Freshness Alarms

Post by tmcdonald »

Do you by chance use NTP? I suppose in theory we could be seeing some time drift correction that would account for a few seconds' drift in either direction, and then ntpd corrects this drift causing either a past or a future discrepancy. Things typically work well when server time is lagging, but not so well when it is in the future.
Former Nagios employee
vinzclortho
Posts: 13
Joined: Mon Mar 16, 2015 9:14 am

Re: Invalid Freshness Alarms

Post by vinzclortho »

We do have NTP configured and monitored with the built-in check_mk agent checks. None of the hosts are off more than a few milliseconds. Even if they were, that would not explain checks that have a 0 timestamp or the repeated checks.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Invalid Freshness Alarms

Post by tgriep »

Can you post or PM the scripts/checks that you are using and how they are ran so we can review them?
Be sure to check out our Knowledgebase for helpful articles and solutions!
vinzclortho
Posts: 13
Joined: Mon Mar 16, 2015 9:14 am

Re: Invalid Freshness Alarms

Post by vinzclortho »

After some spelunking in the nagios source, I think I have discovered how this was happening, if not necessarily why.

I set debug_verbosity=2, which adds "Raw command entry:" logging. From the source, nagios logs Raw Commands before it processes any of the command entries. I noticed that there is a bunch of cruft that builds up in the 'cmd' variable that is logged. For instance :

Code: Select all

[1427382625.039946] [128.1] [pid=26200] External Command Type: 30
[1427382625.039955] [128.1] [pid=26200] Command Entry Time: 1427382625
[1427382625.039959] [128.1] [pid=26200] Command Arguments: host-103;Memcached Check cache-02;0;OK - memcached on host-103:11213 alive: elapsed 0.315 ms
[1427382625.129476] [128.2] [pid=26200] Raw command entry: COMMAND [1427382625] PROCESS_SERVICE_CHECK_RESULT;host-105;Memcached Check cache;0;OK - memcached on host-105:11212 alive: elapsed 0.261 ms
sed 0.315 ms
=4731533
9448
ime=4778691
k cache-02;0;OK - memcached on host-104:11213 alive: elapsed 0.288 ms
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 502 Error Rate;0;OK:  0
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 504 Error Rate;0;OK:  0
219 alive: elapsed 0.189 ms
Everything after the Raw command entry: line is bogus old check data. Notice that the timestamps are 7 seconds old at this point - this was just after a nagios restart. The cruft continues to build up in this variable over time. A longer log showing the cruft building up and changing is available at https://gist.githubusercontent.com/enic ... tfile1.txt

I don't know why this is - gostatus writes newline-delimited text to the command file and it is processed normally almost all of the time. I tried disabling livestatus to see if it was somehow stomping on gostatus in the command file, but I still saw the corruption.

In the meantime, we have rewritten gostatus to write files to the check_result_path rather than submitting these checks to the command pipe. It has not been running in production for long, but we have not seen any issues with late timestamps so far.

I will call this fixed in a few days if the check_results files approach works, but it would still be nice to know why the command file was getting corrupted like that.
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Invalid Freshness Alarms

Post by cmerchant »

It seems that the command pipe should work, although I don't know if there is a lock mechanism that would prevent the clobbered write. Keep us posted, thanks.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Invalid Freshness Alarms

Post by Box293 »

I was playing around with the command pipe the other day and I did observe similar behaviour.

What I found was, if you didn't submit he command with a "\n" at the end then it wouldn't work but was "appended/kept in the buffer" and then later when a correct command was issued the previous kept information was also submitted.

You can see the location of "\n" in the following example:

Code: Select all

now_epoch=$(eval date +%s); printf "[$now_epoch] ADD_HOST_COMMENT;centos03;1;nagiosadmin;This host does funny stuff\n" > /usr/local/nagios/var/rw/nagios.cmd
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked