Invalid Freshness Alarms
-
- Posts: 13
- Joined: Mon Mar 16, 2015 9:14 am
Re: Invalid Freshness Alarms
And one more note - we have the default max_check_result_file_age value (3600). Shouldn't this prevent these really old check results from being processed?
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Invalid Freshness Alarms
If you watch the check_result_path are there any files sitting in there for a long time? 12 hours?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
- Posts: 13
- Joined: Mon Mar 16, 2015 9:14 am
Re: Invalid Freshness Alarms
I have been keeping an eye on stuff in check_result_path. There is generally nothing in there that's more than a minute or so old. A few of the servers did have some very old files - weeks or months old. I suspect these were from crashes - we previously had issues with livestatus causing nagios to crash. I have moved the old results out.
I am now running "while true; do find . -amin 2; sleep 30; done" in the check_result_path on all four hosts and have not seen anything yet.
I am now running "while true; do find . -amin 2; sleep 30; done" in the check_result_path on all four hosts and have not seen anything yet.
-
- Posts: 13
- Joined: Mon Mar 16, 2015 9:14 am
Re: Invalid Freshness Alarms
I have modified my script to tail the nagios.debug log and watch for old ( > 150 second ) commands being pushed in. It has turned up lots of stuff :
During this time, I haven't seen anything > 2 minutes old in the checkresults directory. Furthermore, grepping through the checkresults dir never turns up any of these passive checks that we are pushing in. Everything in the checkresults dir seems to be coming from our check_mk agent checks.
Notice that several of these are the same commands being pushed in multiple times. Are passive results sent in via the command file queued or cached somewhere else within nagios? These repeated commands are only being pushed in once via gostatus.
Code: Select all
[1427128588.816659] [128.1] [pid=9556] Command Entry Time: 1427128385
large diff : 203 - 1427128588 1427128385
[1427131034.299819] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 133 - 1427131034 1427130901
[1427131049.299997] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 148 - 1427131049 1427130901
[1427131064.300130] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 163 - 1427131064 1427130901
[1427131079.300334] [128.1] [pid=9556] Command Entry Time: 1427130901
large diff : 178 - 1427131079 1427130901
[1427131274.355924] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 132 - 1427131274 1427131142
[1427131289.356070] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 147 - 1427131289 1427131142
[1427131319.390478] [128.1] [pid=9556] Command Entry Time: 1427131142
large diff : 177 - 1427131319 1427131142
[1427132309.562493] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 123 - 1427132309 1427132186
[1427132324.562721] [128.1] [pid=9556] Command Entry Time: 1427132186
large diff : 138 - 1427132324 1427132186
Notice that several of these are the same commands being pushed in multiple times. Are passive results sent in via the command file queued or cached somewhere else within nagios? These repeated commands are only being pushed in once via gostatus.
Re: Invalid Freshness Alarms
Do you by chance use NTP? I suppose in theory we could be seeing some time drift correction that would account for a few seconds' drift in either direction, and then ntpd corrects this drift causing either a past or a future discrepancy. Things typically work well when server time is lagging, but not so well when it is in the future.
Former Nagios employee
-
- Posts: 13
- Joined: Mon Mar 16, 2015 9:14 am
Re: Invalid Freshness Alarms
We do have NTP configured and monitored with the built-in check_mk agent checks. None of the hosts are off more than a few milliseconds. Even if they were, that would not explain checks that have a 0 timestamp or the repeated checks.
Re: Invalid Freshness Alarms
Can you post or PM the scripts/checks that you are using and how they are ran so we can review them?
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 13
- Joined: Mon Mar 16, 2015 9:14 am
Re: Invalid Freshness Alarms
After some spelunking in the nagios source, I think I have discovered how this was happening, if not necessarily why.
I set debug_verbosity=2, which adds "Raw command entry:" logging. From the source, nagios logs Raw Commands before it processes any of the command entries. I noticed that there is a bunch of cruft that builds up in the 'cmd' variable that is logged. For instance :
Everything after the Raw command entry: line is bogus old check data. Notice that the timestamps are 7 seconds old at this point - this was just after a nagios restart. The cruft continues to build up in this variable over time. A longer log showing the cruft building up and changing is available at https://gist.githubusercontent.com/enic ... tfile1.txt
I don't know why this is - gostatus writes newline-delimited text to the command file and it is processed normally almost all of the time. I tried disabling livestatus to see if it was somehow stomping on gostatus in the command file, but I still saw the corruption.
In the meantime, we have rewritten gostatus to write files to the check_result_path rather than submitting these checks to the command pipe. It has not been running in production for long, but we have not seen any issues with late timestamps so far.
I will call this fixed in a few days if the check_results files approach works, but it would still be nice to know why the command file was getting corrupted like that.
I set debug_verbosity=2, which adds "Raw command entry:" logging. From the source, nagios logs Raw Commands before it processes any of the command entries. I noticed that there is a bunch of cruft that builds up in the 'cmd' variable that is logged. For instance :
Code: Select all
[1427382625.039946] [128.1] [pid=26200] External Command Type: 30
[1427382625.039955] [128.1] [pid=26200] Command Entry Time: 1427382625
[1427382625.039959] [128.1] [pid=26200] Command Arguments: host-103;Memcached Check cache-02;0;OK - memcached on host-103:11213 alive: elapsed 0.315 ms
[1427382625.129476] [128.2] [pid=26200] Raw command entry: COMMAND [1427382625] PROCESS_SERVICE_CHECK_RESULT;host-105;Memcached Check cache;0;OK - memcached on host-105:11212 alive: elapsed 0.261 ms
sed 0.315 ms
=4731533
9448
ime=4778691
k cache-02;0;OK - memcached on host-104:11213 alive: elapsed 0.288 ms
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 502 Error Rate;0;OK: 0
COMMAND [1427382618] PROCESS_SERVICE_CHECK_RESULT;virtual-01;Site LB 504 Error Rate;0;OK: 0
219 alive: elapsed 0.189 ms
I don't know why this is - gostatus writes newline-delimited text to the command file and it is processed normally almost all of the time. I tried disabling livestatus to see if it was somehow stomping on gostatus in the command file, but I still saw the corruption.
In the meantime, we have rewritten gostatus to write files to the check_result_path rather than submitting these checks to the command pipe. It has not been running in production for long, but we have not seen any issues with late timestamps so far.
I will call this fixed in a few days if the check_results files approach works, but it would still be nice to know why the command file was getting corrupted like that.
Re: Invalid Freshness Alarms
It seems that the command pipe should work, although I don't know if there is a lock mechanism that would prevent the clobbered write. Keep us posted, thanks.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Invalid Freshness Alarms
I was playing around with the command pipe the other day and I did observe similar behaviour.
What I found was, if you didn't submit he command with a "\n" at the end then it wouldn't work but was "appended/kept in the buffer" and then later when a correct command was issued the previous kept information was also submitted.
You can see the location of "\n" in the following example:
What I found was, if you didn't submit he command with a "\n" at the end then it wouldn't work but was "appended/kept in the buffer" and then later when a correct command was issued the previous kept information was also submitted.
You can see the location of "\n" in the following example:
Code: Select all
now_epoch=$(eval date +%s); printf "[$now_epoch] ADD_HOST_COMMENT;centos03;1;nagiosadmin;This host does funny stuff\n" > /usr/local/nagios/var/rw/nagios.cmd
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.