they worked flawlessly in the older nagios but have now stopped working. Digging into this I have found that one reason seem to be that the services are scheduled to
run outside their valid check period. When nagios does it pre-servicecheck-checking it seems to disqualify that run and re-schedules it. Mostly it seem to be scheduled incorrectly again, but once in a while it hits within it's check-period and executes.
check_interval is set at 31 minutes, check periods are 10 minutes long to make sure that the check is run only once
Code: Select all
define service{
use generic-service
name generic-preflight ; The 'name' of this service template
notification_interval 1440
notification_options w,u,c
check_interval 31
retry_interval 31
max_check_attempts 1
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}Code: Select all
define timeperiod{
timeperiod_name testservice-timeperiod
monday 20:50-21:00
tuesday 20:50-21:00
wednesday 20:50-21:00
thursday 20:50-21:00
friday 20:50-21:00
}
Check is run normally and a notification sent out, then this follows:[1463100977.111773] [008.0] [pid=15327] ** Timed Event ** Type: EVENT_SERVICE_CHECK, Run Time: Thu May 12 20:56:17 2016
[1463100977.111785] [008.0] [pid=15327] ** Service Check Event ==> Host: 'testhost', Service: 'testservice', Options: 0, Latency: 0.000000 sec
[1463100977.111864] [001.0] [pid=15327] run_scheduled_service_check() start
I am guessing that the first reschedule line concerns the 31 minute check_interval for the service, which is then "corrected" after the check_period is checked.[1463100977.132755] [2048.1] [pid=15327] **** END MACRO PROCESSING *************
[1463100977.132794] [064.1] [pid=15327] Making callbacks (type 2)...
[1463100977.132812] [001.0] [pid=15327] macros_to_kvv()
[1463100977.132829] [064.1] [pid=15327] Making callbacks (type 21)...
[1463100977.132835] [064.1] [pid=15327] Making callbacks (type 20)...
[1463100977.132843] [001.0] [pid=15327] get_next_service_notification_time()
[1463100977.132865] [032.0] [pid=15327] 1 contacts were notified. Next possible notification time: Fri May 13 20:56:17 2016
[1463100977.132876] [032.0] [pid=15327] 1 contacts were notified.
[1463100977.132882] [064.1] [pid=15327] Making callbacks (type 5)...
[1463100977.132887] [064.1] [pid=15327] Making callbacks (type 13)...
[1463100977.132894] [001.0] [pid=15327] handle_service_event()
[1463100977.132901] [064.1] [pid=15327] Making callbacks (type 23)...
[1463100977.132932] [016.1] [pid=15327] Rescheduling next check of service at Thu May 12 21:27:17 2016
[1463100977.133449] [001.0] [pid=15327] get_next_valid_time()
[1463100977.133540] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133579] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133639] [001.0] [pid=15327] _get_matching_timerange()
[1463100977.133693] [001.0] [pid=15327] schedule_service_check()
[1463100977.133769] [016.0] [pid=15327] Scheduling a non-forced, active check of service 'testservice' on host 'testhost' @ Fri May 13 21:12:06 2016
[1463100977.133793] [001.0] [pid=15327] add_event()
But as can be seen from the output above the service is scheduled to run outside a valid check-period, this will not work and the service will be
rescheduled again. It seems to be random whether it will be scheduled to a valid point in time for execution or if it will have to do another reschedule-cycle.
Is this a known bug or has the configuration logic changed in Nagios 4? The default 24x7 check period works without any issues so this is isolated to our
once-a-day checks.