host rechecks too fast
Posted: Wed Nov 27, 2013 8:30 pm
I'm trying to figure out something which seems pretty straight forward, which is the relationship between recheck_interval and max_check_attempts. I thought I understand what they're supposed to do, but it isn't working out that way. From reading the docs, recheck_interval should be the amount of time between checks after the first SOFT DOWN state is determined.
But instead the recheck interval seems to be working out to roughly the time between SOFT DOWN and HARD DOWN and the attempts are being squeezed into that interval.
Here's some of the tests and configurations I did...
check_interval = 1, recheck_interval = 1, max_check_attempts = 6, result...
[2013-11-27 15:37:30] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:14] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:09] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:32] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:26] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:10] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
I would think that the host would be rechecked every minute for 6 minutes until a HARD DOWN state is determined. But that's not what's happening. Instead, what happens is Nagios checks the host 6 times within about 1.5 minutes.
OK, so I tried this too just to see what would happen: check interval = 1, recheck interval = 2, max attempts interval = 6
[2013-11-27 16:01:05] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:01:01] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:33] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:01] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:59:01] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:58:29] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: rta nan, lost 100%
As you can see, now the max attempts are being spread out within a 2.5 minute period.
OK, so another test: check_interval = 1, recheck_interval = 6, max_check_attempts = 6. I would expect one check approximately every minute for a total of 6 minutes. But no...
[2013-11-27 16:20:10] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:19:10] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:18:10] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:17:10] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:31] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:12] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Results show an interval of about 4 minutes, wherever that came from, with the 2nd and 3rd check happening too soon, but subsequent checks occurring at exactly 60 second intervals.
Now, this might seem backwards to some, why recheck less often after a state change. To avoid having to explain that and to show that it still doesn't work as expected/understood, I did another test that did the opposite, check less frequently if it's OK, and more frequently if not.
check_interval = 5, recheck_interval = 1, max_check_attempts = 5, what should happen is that after the first SOFT DOWN, Nagios should recheck every minute for 5 minutes, but nope, I get 5 checks in less than 2 minutes...
2013-11-27 17:17:15] HOST ALERT: ash;DOWN;HARD;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:17:11] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:40] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:11] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:15:36] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
So, there's really something I'm missing here.
Here are the interval settings from nagios.cfg
command_check_interval=-1
auto_rescheduling_interval=30
retention_update_interval=240
interval_length=60
service_freshness_check_interval=60
host_freshness_check_interval=60
So, is this normal? I just don't get it. Seems bonehead simple, but for some reason it is eluding me.
Any words from the wise would be much appreciated!
But instead the recheck interval seems to be working out to roughly the time between SOFT DOWN and HARD DOWN and the attempts are being squeezed into that interval.
Here's some of the tests and configurations I did...
check_interval = 1, recheck_interval = 1, max_check_attempts = 6, result...
[2013-11-27 15:37:30] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:14] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:09] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:32] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:26] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:10] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
I would think that the host would be rechecked every minute for 6 minutes until a HARD DOWN state is determined. But that's not what's happening. Instead, what happens is Nagios checks the host 6 times within about 1.5 minutes.
OK, so I tried this too just to see what would happen: check interval = 1, recheck interval = 2, max attempts interval = 6
[2013-11-27 16:01:05] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:01:01] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:33] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:01] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:59:01] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:58:29] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: rta nan, lost 100%
As you can see, now the max attempts are being spread out within a 2.5 minute period.
OK, so another test: check_interval = 1, recheck_interval = 6, max_check_attempts = 6. I would expect one check approximately every minute for a total of 6 minutes. But no...
[2013-11-27 16:20:10] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:19:10] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:18:10] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:17:10] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:31] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:12] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Results show an interval of about 4 minutes, wherever that came from, with the 2nd and 3rd check happening too soon, but subsequent checks occurring at exactly 60 second intervals.
Now, this might seem backwards to some, why recheck less often after a state change. To avoid having to explain that and to show that it still doesn't work as expected/understood, I did another test that did the opposite, check less frequently if it's OK, and more frequently if not.
check_interval = 5, recheck_interval = 1, max_check_attempts = 5, what should happen is that after the first SOFT DOWN, Nagios should recheck every minute for 5 minutes, but nope, I get 5 checks in less than 2 minutes...
2013-11-27 17:17:15] HOST ALERT: ash;DOWN;HARD;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:17:11] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:40] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:11] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:15:36] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
So, there's really something I'm missing here.
Here are the interval settings from nagios.cfg
command_check_interval=-1
auto_rescheduling_interval=30
retention_update_interval=240
interval_length=60
service_freshness_check_interval=60
host_freshness_check_interval=60
So, is this normal? I just don't get it. Seems bonehead simple, but for some reason it is eluding me.
Any words from the wise would be much appreciated!