Page 2 of 2

Re: host rechecks too fast

Posted: Mon Jan 13, 2014 9:58 am
by slansing
You can try switching your max_host_check_spread to 30 minutes instead of 1 minute in the nagios.cfg and then restart nagios. This is what that value does:
This option determines the maximum number of minutes from when Nagios starts that all hosts (that are scheduled to be regularly checked) are checked. This option will automatically adjust the host inter-check delay method (if necessary) to ensure that the initial checks of all hosts occur within the timeframe you specify. In general, this option will not have an affect on host check scheduling if scheduling information is being retained using the use_retained_scheduling_info option. Default value is 30 (minutes).

Re: host rechecks too fast

Posted: Thu Jan 16, 2014 7:55 pm
by pfarthing6
Giving that a shot. Will post results soon. Thanks!

Re: host rechecks too fast

Posted: Fri Jan 17, 2014 11:47 am
by slansing
Great! Keep us up to date.

Re: host rechecks too fast

Posted: Fri Jan 17, 2014 3:13 pm
by pfarthing6
No joy I'm afraid. The change to that setting had no effect whatsoever.

The expectation with settings below would be that after SOFT DOWN 1, Nagios should retry every 2 min. Instead it retried 8 times within a period of about 3 minutes.
Test host: Ash. Test Setings: check_interval 1, retry_interval 2, max_check_attempts 8

Host Down[2014-01-17 12:03:37] HOST ALERT: ash;DOWN;HARD;8;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:02:50] HOST ALERT: ash;DOWN;SOFT;7;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:02:33] HOST ALERT: ash;DOWN;SOFT;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:02:22] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:01:38] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:00:46] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
Host Down[2014-01-17 12:00:30] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: rta nan, lost 100%
Host Down[2014-01-17 12:00:18] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: rta nan, lost 100%

I think the solution to this is a fresh deployment of Nagios 4.x with a vanilla nagios.cfg. But until I get to that, which will be a couple of weeks at least, I'm willing to try other tests if anyone can think of something.

Cheers!

Re: host rechecks too fast

Posted: Mon Jan 20, 2014 10:32 am
by slansing
We are going to try to draw up plans for some more tests on your core system. I would agree at this point since it appears to be a mystery what is causing this, that you should roll out a new system and just copy the configurations, etc, over to mirror it.

Re: host rechecks too fast

Posted: Wed Feb 12, 2014 7:45 pm
by pfarthing6
Still haven't rolled out the newer version, but yes, that's the plan right now. But if in the mean time you'd like me to try anything else, that would be OK.

Thanks for all the effort =)

-t

Re: host rechecks too fast

Posted: Thu Feb 13, 2014 10:23 am
by slansing
Hmm, what do you have in the "broker_module=" section of your nagios.cfg?