Service rescheduled outside a valid check period
Posted: Thu May 12, 2016 9:59 am
We have recently moved from a nagios version 3 to version 4.0.8 (the version available in centos/epel 7). Some of the checks in our installation have to run once a day,
they worked flawlessly in the older nagios but have now stopped working. Digging into this I have found that one reason seem to be that the services are scheduled to
run outside their valid check period. When nagios does it pre-servicecheck-checking it seems to disqualify that run and re-schedules it. Mostly it seem to be scheduled incorrectly again, but once in a while it hits within it's check-period and executes.
check_interval is set at 31 minutes, check periods are 10 minutes long to make sure that the check is run only once
Here is some debug lines showing the behavior. The initial attempt actually got scheduled within the check-period:
But as can be seen from the output above the service is scheduled to run outside a valid check-period, this will not work and the service will be
rescheduled again. It seems to be random whether it will be scheduled to a valid point in time for execution or if it will have to do another reschedule-cycle.
Is this a known bug or has the configuration logic changed in Nagios 4? The default 24x7 check period works without any issues so this is isolated to our
once-a-day checks.
they worked flawlessly in the older nagios but have now stopped working. Digging into this I have found that one reason seem to be that the services are scheduled to
run outside their valid check period. When nagios does it pre-servicecheck-checking it seems to disqualify that run and re-schedules it. Mostly it seem to be scheduled incorrectly again, but once in a while it hits within it's check-period and executes.
check_interval is set at 31 minutes, check periods are 10 minutes long to make sure that the check is run only once
Code: Select all
define service{
use generic-service
name generic-preflight ; The 'name' of this service template
notification_interval 1440
notification_options w,u,c
check_interval 31
retry_interval 31
max_check_attempts 1
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}Code: Select all
define timeperiod{
timeperiod_name testservice-timeperiod
monday 20:50-21:00
tuesday 20:50-21:00
wednesday 20:50-21:00
thursday 20:50-21:00
friday 20:50-21:00
}
Check is run normally and a notification sent out, then this follows:[1463100977.111773] [008.0] [pid=15327] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Thu May 12 20:56:17 2016
[1463100977.111785] [008.0] [pid=15327] ** Service Check Event ==> Host: 'testhost', Service: 'testservice', Options: 0, Latency: 0.000000 sec
[1463100977.111864] [001.0] [pid=15327] run_scheduled_service_check() start
I am guessing that the first reschedule line concerns the 31 minute check_interval for the service, which is then "corrected" after the check_period is checked.[1463100977.132755] [2048.1] [pid=15327] **** END MACRO PROCESSING *************
[1463100977.132794] [064.1] [pid=15327] Making callbacks (type 2)...
[1463100977.132812] [001.0] [pid=15327] macros_to_kvv()
[1463100977.132829] [064.1] [pid=15327] Making callbacks (type 21)...
[1463100977.132835] [064.1] [pid=15327] Making callbacks (type 20)...
[1463100977.132843] [001.0] [pid=15327] get_next_service_notification_time()
[1463100977.132865] [032.0] [pid=15327] 1 contacts were notified. Next possible notification time: Fri May 13 20:56:17 2016
[1463100977.132876] [032.0] [pid=15327] 1 contacts were notified.
[1463100977.132882] [064.1] [pid=15327] Making callbacks (type 5)...
[1463100977.132887] [064.1] [pid=15327] Making callbacks (type 13)...
[1463100977.132894] [001.0] [pid=15327] handle_service_event()
[1463100977.132901] [064.1] [pid=15327] Making callbacks (type 23)...
[1463100977.132932] [016.1] [pid=15327] Rescheduling next check of service at Thu May 12 21:27:17 2016
[1463100977.133449] [001.0] [pid=15327] get_next_valid_time()
[1463100977.133540] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133579] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133639] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133693] [001.0] [pid=15327] schedule_service_check()
[1463100977.133769] [016.0] [pid=15327] Scheduling a non-forced, active check of service 'testservice' on host 'testhost' @ Fri May 13 21:12:06 2016
[1463100977.133793] [001.0] [pid=15327] add_event()
But as can be seen from the output above the service is scheduled to run outside a valid check-period, this will not work and the service will be
rescheduled again. It seems to be random whether it will be scheduled to a valid point in time for execution or if it will have to do another reschedule-cycle.
Is this a known bug or has the configuration logic changed in Nagios 4? The default 24x7 check period works without any issues so this is isolated to our
once-a-day checks.