Page 1 of 2

host rechecks too fast

Posted: Wed Nov 27, 2013 8:30 pm
by pfarthing6
I'm trying to figure out something which seems pretty straight forward, which is the relationship between recheck_interval and max_check_attempts. I thought I understand what they're supposed to do, but it isn't working out that way. From reading the docs, recheck_interval should be the amount of time between checks after the first SOFT DOWN state is determined.

But instead the recheck interval seems to be working out to roughly the time between SOFT DOWN and HARD DOWN and the attempts are being squeezed into that interval.

Here's some of the tests and configurations I did...

check_interval = 1, recheck_interval = 1, max_check_attempts = 6, result...

[2013-11-27 15:37:30] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:14] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:37:09] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:32] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:26] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:36:10] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%

I would think that the host would be rechecked every minute for 6 minutes until a HARD DOWN state is determined. But that's not what's happening. Instead, what happens is Nagios checks the host 6 times within about 1.5 minutes.

OK, so I tried this too just to see what would happen: check interval = 1, recheck interval = 2, max attempts interval = 6

[2013-11-27 16:01:05] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:01:01] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:33] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:00:01] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:59:01] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 15:58:29] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: rta nan, lost 100%

As you can see, now the max attempts are being spread out within a 2.5 minute period.

OK, so another test: check_interval = 1, recheck_interval = 6, max_check_attempts = 6. I would expect one check approximately every minute for a total of 6 minutes. But no...

[2013-11-27 16:20:10] HOST ALERT: ash;DOWN;HARD;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:19:10] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:18:10] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:17:10] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:31] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 16:16:12] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%

Results show an interval of about 4 minutes, wherever that came from, with the 2nd and 3rd check happening too soon, but subsequent checks occurring at exactly 60 second intervals.

Now, this might seem backwards to some, why recheck less often after a state change. To avoid having to explain that and to show that it still doesn't work as expected/understood, I did another test that did the opposite, check less frequently if it's OK, and more frequently if not.

check_interval = 5, recheck_interval = 1, max_check_attempts = 5, what should happen is that after the first SOFT DOWN, Nagios should recheck every minute for 5 minutes, but nope, I get 5 checks in less than 2 minutes...

2013-11-27 17:17:15] HOST ALERT: ash;DOWN;HARD;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:17:11] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:40] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:16:11] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 17:15:36] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%

So, there's really something I'm missing here.

Here are the interval settings from nagios.cfg

command_check_interval=-1
auto_rescheduling_interval=30
retention_update_interval=240
interval_length=60
service_freshness_check_interval=60
host_freshness_check_interval=60

So, is this normal? I just don't get it. Seems bonehead simple, but for some reason it is eluding me.

Any words from the wise would be much appreciated!

Re: host rechecks too fast

Posted: Sun Dec 01, 2013 6:05 pm
by jsmurphy
That's rather interesting... it certainly doesn't behave in that manner for me and I've never seen it exhibit that quirk before. What version of Nagios are you using?

Re: host rechecks too fast

Posted: Mon Dec 02, 2013 1:35 pm
by tmcdonald
In addition to your version, what is your server load? Could be some funky scheduling going on if the load is high, though that wouldn't explain why things are going faster.

Re: host rechecks too fast

Posted: Tue Dec 03, 2013 4:22 pm
by pfarthing6
Running version 3.4.1, compiled from source.

I get load spikes about every 10 minutes to about 2.0, but the average is about 0.9.

Also, this isn't happening with services, only with host checks.

Any other configuration settings besides the ones related to intervals in the nagios.cfg that could cause this odd behavior?

Re: host rechecks too fast

Posted: Tue Dec 03, 2013 5:58 pm
by slansing
Can you show us one of these host's configuration files? And any templates associated with it? Preferably this host, 'ash'.

Re: host rechecks too fast

Posted: Wed Dec 04, 2013 3:46 pm
by pfarthing6
Ash was a test host and is no longer configured as such. But for each host down test, the parameters are included with the results in my original post. However the host template used is below.

I am including a configuration that is current for another host, Malheur, and the results of an unexpected host down and recovery for that host just this morning (yes same host template).

Code: Select all

define host{
    name                    MediumUrgencyHostTemplate
    check_period            AllTheTimeTimeperiod
    check_interval          1                      ; Actively check the host every 1 minutes
    retry_interval          1                      ; Schedule host check retries at 1 minute intervals
    max_check_attempts      3                      ; Check each host 3 times (max)
    check_command           CheckHostAliveIcmpCommand 
    notification_period     AllTheTimeExceptPatchRebootWindowTimeperiod
    notification_interval   120
    notification_options    d,r,f
    contact_groups          MediumUrgencyContactGroup
    register                0 
    }

define host{
    use                     MediumUrgencyHostTemplate
    host_name               malheur
    address                 10.102.122.22
    check_interval          1
    retry_interval          6
    max_check_attempts      5
    }

Code: Select all

[2013-12-04 02:36:21] HOST ALERT: malheur;UP;HARD;1;OK - 10.102.122.22: rta 20.562ms, lost 0%
[2013-12-04 02:36:11] HOST ALERT: malheur;DOWN;HARD;1;CRITICAL - 10.102.122.22: rta nan, lost 100%
[2013-12-04 02:34:00] HOST ALERT: malheur;UNREACHABLE;HARD;5;CRITICAL - 10.102.122.22: rta nan, lost 100%
[2013-12-04 02:32:58] HOST ALERT: malheur;UNREACHABLE;SOFT;4;CRITICAL - 10.102.122.22: rta nan, lost 100%
[2013-12-04 02:32:40] HOST ALERT: malheur;UNREACHABLE;SOFT;3;CRITICAL - 10.102.122.22: rta nan, lost 100%
[2013-12-04 02:32:27] HOST ALERT: malheur;UNREACHABLE;SOFT;2;CRITICAL - 10.102.122.22: rta nan, lost 100%
[2013-12-04 02:32:16] HOST ALERT: malheur;DOWN;SOFT;1;CRITICAL - 10.102.122.22: rta nan, lost 100%
I know this may look weird but as I said, but I'm trying different things. Here is another test with Ash with the same odd settings and also oddly, I'm getting the results that I want, 1 minute between rechecks and 5 checks total.

But as you can see by comparing the results between Ash and Malheur above, each with the exact same settings, same host template, same nagios setup, I got significantly different results.

Code: Select all

Ash Test Settings: check_interval = 1, recheck_interval = 6, max_check_attempts = 5
Ash Test Results...
2013-11-27 19:02:46] HOST ALERT: ash;DOWN;HARD;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 19:01:46] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 19:00:46] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 18:59:46] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-11-27 18:58:47] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
And if it helps, here's the entirety of the nagios.cfg file...

Code: Select all

--- nagios.cfg -----------------------------------
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/objects
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
external_command_buffer_slots=4096
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives
use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_checks=1
service_inter_check_delay_method=s
max_service_check_spread=1
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=1
max_concurrent_checks=0
check_result_reaper_frequency=1
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=1
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
sleep_time=0.25
service_check_timeout=90
host_check_timeout=70
event_handler_timeout=70
notification_timeout=70
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=240
use_retained_program_state=1
use_retained_scheduling_info=0
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=1
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
host_perfdata_command=ProcessHostPerfdataCommand
service_perfdata_command=ProcessServicePerfdataCommand
obsess_over_services=0
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=1
service_freshness_check_interval=60
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=1
low_service_flap_threshold=25.0
high_service_flap_threshold=40.0
low_host_flap_threshold=25.0
high_host_flap_threshold=40.0
date_format=iso8601
p1_file=/usr/local/nagios/bin/p1.pl
enable_embedded_perl=1
use_embedded_perl_implicitly=1
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"
use_regexp_matching=0
use_true_regexp_matching=0
[email protected]
[email protected]
daemon_dumps_core=0
use_large_installation_tweaks=0
enable_environment_macros=1
debug_level=0
debug_verbosity=2
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=50000000
--- nagios .cfg -----------------

Re: host rechecks too fast

Posted: Wed Dec 04, 2013 5:57 pm
by slansing
Can you try temporarily disabling freshness checking globally in the nagios.cfg?:

Code: Select all

check_service_freshness=0
service_freshness_check_interval=60
Then restart nagios with:

Code: Select all

service nagios stop

service nagios start

Re: host rechecks too fast

Posted: Thu Dec 19, 2013 4:48 pm
by pfarthing6
Took a while to get back to this, but no, suggestion of disabling the freshness check didn't work. Also just to say again, this isn't a problem for the services. They work as expected. It's the host checks that do not. Though I know it's only a difference in definition, it still manifests as a deviation in check frequency for the hosts. Here's an example output with the suggested configuration.

Test host config: check_interval 3, retry_interval 1, max_check_attempts 8

[2013-12-19 13:40:17] HOST ALERT: ash;DOWN;HARD;8;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:40:03] HOST ALERT: ash;DOWN;SOFT;7;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:39:57] HOST ALERT: ash;DOWN;SOFT;6;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:39:53] HOST ALERT: ash;DOWN;SOFT;5;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:39:14] HOST ALERT: ash;DOWN;SOFT;4;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:38:59] HOST ALERT: ash;DOWN;SOFT;3;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:38:53] HOST ALERT: ash;DOWN;SOFT;2;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%
[2013-12-19 13:37:55] HOST ALERT: ash;DOWN;SOFT;1;CRITICAL - 192.168.10.212: Host unreachable @ 192.168.10.161. rta nan, lost 100%

As you can see, some of the rechecks are about 1 minute apart, but on the 4th check they become random and closer together. So, basically no change in the problem =(

Re: host rechecks too fast

Posted: Thu Dec 19, 2013 6:19 pm
by slansing
Please show the output of some of these more obscure options:

Code: Select all

grep "check_host_freshness" /usr/local/nagios/etc/nagios.cfg
grep "host_freshness_check_interval" /usr/local/nagios/etc/nagios.cfg
grep "host_inter_check_delay_method" /usr/local/nagios/etc/nagios.cfg
grep "max_host_check_spread" /usr/local/nagios/etc/nagios.cfg

Re: host rechecks too fast

Posted: Sat Jan 11, 2014 2:33 pm
by pfarthing6
Sorry for the delay, holidays and all, and took a while to get back to this. But here ya go...

check_host_freshness=0
host_freshness_check_interval=60
host_inter_check_delay_method=s
max_host_check_spread=1