Service rescheduled outside a valid check period

pako · Post by **pako** » Thu May 12, 2016 9:59 am

We have recently moved from a nagios version 3 to version 4.0.8 (the version available in centos/epel 7). Some of the checks in our installation have to run once a day,
they worked flawlessly in the older nagios but have now stopped working. Digging into this I have found that one reason seem to be that the services are scheduled to
run outside their valid check period. When nagios does it pre-servicecheck-checking it seems to disqualify that run and re-schedules it. Mostly it seem to be scheduled incorrectly again, but once in a while it hits within it's check-period and executes.

check_interval is set at 31 minutes, check periods are 10 minutes long to make sure that the check is run only once

Code: Select all

define service{
        use                             generic-service
        name                            generic-preflight ; The 'name' of this service template

        notification_interval           1440
        notification_options            w,u,c
        check_interval                  31
        retry_interval                  31
        max_check_attempts              1

        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

Code: Select all

define timeperiod{
   timeperiod_name testservice-timeperiod
   monday          20:50-21:00
   tuesday         20:50-21:00
   wednesday       20:50-21:00
   thursday        20:50-21:00
   friday          20:50-21:00
}

Here is some debug lines showing the behavior. The initial attempt actually got scheduled within the check-period:

[1463100977.111773] [008.0] [pid=15327] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Thu May 12 20:56:17 2016
[1463100977.111785] [008.0] [pid=15327] ** Service Check Event ==> Host: 'testhost', Service: 'testservice', Options: 0, Latency: 0.000000 sec
[1463100977.111864] [001.0] [pid=15327] run_scheduled_service_check() start

Check is run normally and a notification sent out, then this follows:

[1463100977.132755] [2048.1] [pid=15327] **** END MACRO PROCESSING *************
[1463100977.132794] [064.1] [pid=15327] Making callbacks (type 2)...
[1463100977.132812] [001.0] [pid=15327] macros_to_kvv()
[1463100977.132829] [064.1] [pid=15327] Making callbacks (type 21)...
[1463100977.132835] [064.1] [pid=15327] Making callbacks (type 20)...
[1463100977.132843] [001.0] [pid=15327] get_next_service_notification_time()
[1463100977.132865] [032.0] [pid=15327] 1 contacts were notified. Next possible notification time: Fri May 13 20:56:17 2016
[1463100977.132876] [032.0] [pid=15327] 1 contacts were notified.
[1463100977.132882] [064.1] [pid=15327] Making callbacks (type 5)...
[1463100977.132887] [064.1] [pid=15327] Making callbacks (type 13)...
[1463100977.132894] [001.0] [pid=15327] handle_service_event()
[1463100977.132901] [064.1] [pid=15327] Making callbacks (type 23)...
[1463100977.132932] [016.1] [pid=15327] Rescheduling next check of service at Thu May 12 21:27:17 2016
[1463100977.133449] [001.0] [pid=15327] get_next_valid_time()
[1463100977.133540] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133579] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133639] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133693] [001.0] [pid=15327] schedule_service_check()
[1463100977.133769] [016.0] [pid=15327] Scheduling a non-forced, active check of service 'testservice' on host 'testhost' @ Fri May 13 21:12:06 2016
[1463100977.133793] [001.0] [pid=15327] add_event()

I am guessing that the first reschedule line concerns the 31 minute check_interval for the service, which is then "corrected" after the check_period is checked.
But as can be seen from the output above the service is scheduled to run outside a valid check-period, this will not work and the service will be
rescheduled again. It seems to be random whether it will be scheduled to a valid point in time for execution or if it will have to do another reschedule-cycle.

Is this a known bug or has the configuration logic changed in Nagios 4? The default 24x7 check period works without any issues so this is isolated to our
once-a-day checks.

Post by **tgriep** » Thu May 12, 2016 4:58 pm

What I think you are seeing that the first line is the service check but the second line "that is out of the period" is actually the next time a notification can be sent out which is 24 hours.
Did the service check run at that time period the next day?

Also, take a look at his post to see if this helps out.
https://support.nagios.com/forum/viewto ... =7&t=25216

pako · Post by **pako** » Fri May 13, 2016 9:02 am

Thanks for the reply

The problem described in my first post is real, service checks are not run, it is unrelated to notifications. All notifications that would be sent out
are properly done so. The problem is that checks are not run and as far as I have found out, they seem to be incorrectly scheduled outside
a valid check_period, hence when they are supposed to run they are not allowed to and get rescheduled. And when they are rescheduled they
most often get scheduled to an incorrect time again.

This is how the resulting check went from my first example. Note that the time when the check was scheduled to run at first was
Fri May 13 21:12:06 2016, now it has moved to two minutes later for no apparent reason, still both times are outside a valid check_period.

[1463188453.771348] [008.0] [pid=17917] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Fri May 13 21:14:13 2016
[1463188453.771356] [008.0] [pid=17917] ** Service Check Event ==> Host: 'testhost', Service: 'testservice', Options: 0, Latency: 0.000000 sec
[1463188453.771367] [001.0] [pid=17917] run_scheduled_service_check() start
[1463188453.771373] [016.0] [pid=17917] Attempting to run scheduled check of service 'testservice' on host 'testhost': check options=0, latency=0.000000
[1463188453.771392] [001.0] [pid=17917] run_async_service_check()
[1463188453.771424] [001.0] [pid=17917] check_service_check_viability()
[1463188453.771431] [001.0] [pid=17917] check_time_against_period()
[1463188453.771470] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771483] [001.0] [pid=17917] check_service_dependencies()
[1463188453.771490] [016.1] [pid=17917] Unable to run scheduled service check at this time
[1463188453.771496] [001.0] [pid=17917] get_next_valid_time()
[1463188453.771507] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771524] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771549] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771566] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771588] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771599] [001.0] [pid=17917] check_time_against_period()
[1463188453.771610] [001.0] [pid=17917] _get_matching_timerange()
[1463188453.771629] [016.1] [pid=17917] Rescheduled next service check for Mon May 16 20:50:00 2016
[1463188453.771635] [001.0] [pid=17917] schedule_service_check()
[1463188453.771646] [016.0] [pid=17917] Scheduling a non-forced, active check of service 'testservice' on host 'testhost' @ Mon May 16 21:01:28 2016
[1463188453.771655] [001.0] [pid=17917] add_event()
[1463188453.771663] [064.1] [pid=17917] Making callbacks (type 13)...
[1463188453.771669] [064.1] [pid=17917] Making callbacks (type 13)...
[1463188453.771674] [001.0] [pid=17917] handle_timed_event() end

The servicecheck fails as expected as it isn't allowed to be run outside it's check_period. This time it actually got rescheduled to within the check_period in the first message, which
I assume is a result of check_interval, but then it gets a second reschedule that gets it wrong, to a time which is 1 minute and 28 seconds after the end of a valid check_period.
Interesting to note that nagios still gets the weekdays correct from the check_period.

Post by **tgriep** » Fri May 13, 2016 12:05 pm

Nagios 4 changed the way scheduling works so getting a check to run at a specific time is difficult to do.
It tries to spread them out as not to overload the server.

One option is to run the check with cron and passively submit the data to Nagios.
Another way is to set the check_interval less than 10 minutes. Then it should run during the time period but is may run twice during that time period.

Other than that, you can take a look at the service_inter_check_delay_method option in the nagios.cfg file and adjust that and see if it helps.
https://assets.nagios.com/downloads/nag ... gmain.html

pako · Post by **pako** » Mon May 16, 2016 2:03 am

Ok, that would explain the behavior. But I am still a bit confused because it seems that nagios violates it's own rules in this case.
It worked perfectly good in nagios 3 where a check was always scheduled to within it's check_period. Why is it acceptable for the
new scheduling code to schedule a check to outside a check_period? Is my check_period too tight, in that case what is the shortest
acceptable check_period? It seems to me that the check_period parameter is useless if it can't be used for anything else than 24x7.

Is there a summary somewhere (except the sources) of how the scheduling was changed in nagios 4?

pako · Post by **pako** » Mon May 16, 2016 6:15 am

A quick update:

I changed the service_inter_check_delay_method from smart to dumb and restarted nagios. The services were immediately
rescheduled to within their respective check_periods. If I change it back to smart and restart they are again scheduled to outside
their check_periods. So the smart-mode seem to be faulty in combination with a custom check_period (possibly it has to be short
in the order of tens of minutes).

Post by **tgriep** » Mon May 16, 2016 9:20 am

I think it is having issues because the smart method is trying to even out the checks and that the check_period is larger than the window causing the reschedule.

One more thing to do is try disabling the retention of non-status information and see if this is going to help.
Setting "retain_nonstatus_information = 0" will cause nagios to take the initial values from the configs, rather than from the state retention file when it restarts.
Let us know if this helps or not.

pako · Post by **pako** » Tue May 17, 2016 8:17 am

Yes, it seems that the smart-mode is the culprit. I have been running with the "dumb"-mode enabled for two days now with all checks working as expected.

But still one could argue that the smart-mode algorithm is broken if it only works in a 24x7 check_period. Should this be reported as a bug?

Post by **tgriep** » Tue May 17, 2016 1:34 pm

Can you set the check_interval to 9 minutes and re-enable Smart scheduling to see if it works like you are looking for?

Nagios Support Forum

Service rescheduled outside a valid check period

Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period

Re: Service rescheduled outside a valid check period