Page 1 of 1

Issue with service parents after upgrading to 4.4.0 / 4.4.1

Posted: Tue Jul 03, 2018 3:16 am
by intercorp
Hello,
after upgrading from 4.3.4 to 4.4.0 the service parents aren't working correct anymore. When a service enters an "unknown" or "error" state, rechecking always fails. When I force a recheck via webinterface, the time for "next scheduled check" raises, but "last check time" stays at the same value.

So I activated the debug log with following settings:

Code: Select all

debug_level=16
debug_verbosity=2
and rescheduled the check. These are the corresponding lines I found in nagios.debug file:

Code: Select all

[1530604260.156238] [016.0] [pid=1139] Scheduling a forced, active check of service 'CPU Load' on host 'backup' @ Tue Jul  3 09:50:58 2018
[1530604260.156268] [016.2] [pid=1139] Found another service check event for this service @ Tue Jul  3 09:51:34 2018
[1530604260.156279] [016.2] [pid=1139] New service check event is forced and occurs before the existing event, so the new event will be used instead.
[1530604260.156291] [016.2] [pid=1139] Scheduling new service check event.
[1530604260.156324] [016.0] [pid=1139] Attempting to run scheduled check of service 'CPU Load' on host 'backup': check options=1, latency=0.000015
[1530604260.156355] [016.2] [pid=1139] Execution parents for this service failed, so it will not be actively checked.
[1530604260.156364] [016.1] [pid=1139] Unable to run scheduled service check at this time
[1530604260.156376] [016.1] [pid=1139] Rescheduled next service check for Tue Jul  3 09:56:00 2018
[1530604260.156385] [016.0] [pid=1139] Scheduling a forced, active check of service 'CPU Load' on host 'backup' @ Tue Jul  3 09:56:00 2018
[1530604260.156392] [016.2] [pid=1139] Scheduling new service check event.
[1530604262.285685] [016.0] [pid=1139] Attempting to run scheduled check of service 'CPU Usage' on host 'backup': check options=0, latency=0.000000
[1530604262.285745] [016.2] [pid=1139] Execution parents for this service failed, so it will not be actively checked.
[1530604262.285791] [016.1] [pid=1139] Unable to run scheduled service check at this time
[1530604262.285802] [016.1] [pid=1139] Rescheduled next service check for Tue Jul  3 09:56:02 2018
[1530604262.285810] [016.0] [pid=1139] Scheduling a non-forced, active check of service 'CPU Usage' on host 'backup' @ Tue Jul  3 09:56:02 2018
[1530604262.285822] [016.2] [pid=1139] Scheduling new service check event.
So it seems, that the check isn't executed because the parent check is failing, but the parent service (PING) ist in state OK.

This is only one example, I've seen this behaviour on other checks with parents and on a second nagios instance which I also upgraded to version 4.4.0. The upgrade to 4.4.1 didn't solve the Problem. The check runs without error, when I remove the parents line from the service definition.

The service definition as an example:

Code: Select all

define service{
    use                     generic-service,graphed-service
    host_name               backup
    service_description     CPU Load
    parents                 PING
    check_command           check_nrpe!check_load
    servicegroups           system-health
}
kind regards
Stephan

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Tue Jul 03, 2018 12:01 pm
by bheden
First and foremost: thank you for the detailed description of the problem.

A lot of the code regarding scheduling/handling results was refactored in 4.4.0 - we "paid back a lot of technical debt" so to speak. I think I have an idea of where the issue lies (somewhere in that refactoring) and I'll be investigating shortly.

I'll post back with news.
99 little bugs in the code 99 little bugs in the code Take one down, patch it around 117 little bugs in the code

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Tue Jul 03, 2018 2:52 pm
by bheden
Can you set debug_level to 17 and send me the updated appropriate log snippet? Thanks.

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Mon Aug 13, 2018 5:39 am
by intercorp
I'm sorry for the late reply, I was on vacation...

So here is the log with debug_level 17:

Code: Select all

[1534156159.982525] [016.0] [pid=15751] Scheduling a forced, active check of service 'Cron' on host 'backup' @ Mon Aug 13 12:29:01 2018
[1534156159.982541] [016.2] [pid=15751] Found another service check event for this service @ Mon Aug 13 12:34:07 2018
[1534156159.982552] [016.2] [pid=15751] New service check event is forced, so it will be used instead of the existing event.
[1534156159.982573] [016.2] [pid=15751] Scheduling new service check event.
[1534156159.982587] [001.0] [pid=15751] add_event()
[1534156159.982618] [001.0] [pid=15751] handle_timed_event() start
[1534156159.982633] [001.0] [pid=15751] run_scheduled_service_check() start
[1534156159.982642] [016.0] [pid=15751] Attempting to run scheduled check of service 'Cron' on host 'backup': check options=1, latency=0.000031
[1534156159.982656] [001.0] [pid=15751] run_async_service_check()
[1534156159.982662] [001.0] [pid=15751] check_service_check_viability()
[1534156159.982669] [001.0] [pid=15751] check_service_parents()
[1534156159.982675] [016.2] [pid=15751] Execution parents for this service failed, so it will not be actively checked.
[1534156159.982687] [016.1] [pid=15751] Unable to run scheduled service check at this time
[1534156159.982694] [001.0] [pid=15751] get_next_valid_time()
[1534156159.982702] [001.0] [pid=15751] _get_matching_timerange()
[1534156159.982711] [016.1] [pid=15751] Rescheduled next service check for Mon Aug 13 12:34:19 2018
[1534156159.982717] [001.0] [pid=15751] schedule_service_check()
kind regards,
stephan

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Mon Aug 13, 2018 3:44 pm
by bheden
I believe this issue has since been resolved in the current working branch of Core. Can you compile the current maint branch and see if that resolves your issue?

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Fri Aug 17, 2018 3:26 am
by intercorp
I've upgraded to 4.4.2, but unfortunately it didn't solve the issue. nagios.debug says:

Code: Select all

[1534494019.593963] [016.0] [pid=27661] Scheduling a forced, active check of service 'Cron' on host 'backup' @ Fri Aug 17 10:20:17 2018
[1534494019.593971] [016.2] [pid=27661] Found another service check event for this service @ Fri Aug 17 10:20:36 2018
[1534494019.593978] [016.2] [pid=27661] New service check event is forced and occurs before the existing event, so the new event will be used instead.
[1534494019.593986] [016.2] [pid=27661] Scheduling new service check event.
[1534494019.593996] [001.0] [pid=27661] add_event()
[1534494019.594016] [001.0] [pid=27661] handle_timed_event() start
[1534494019.594025] [001.0] [pid=27661] run_scheduled_service_check() start
[1534494019.594032] [016.0] [pid=27661] Attempting to run scheduled check of service 'Cron' on host 'backup': check options=1, latency=0.000022
[1534494019.594041] [001.0] [pid=27661] run_async_service_check()
[1534494019.594047] [001.0] [pid=27661] check_service_check_viability()
[1534494019.594053] [001.0] [pid=27661] check_service_parents()
[1534494019.594060] [016.2] [pid=27661] Execution parents for this service failed, so it will not be actively checked.
[1534494019.594066] [016.1] [pid=27661] Unable to run scheduled service check at this time
[1534494019.594072] [001.0] [pid=27661] get_next_valid_time()
[1534494019.594080] [001.0] [pid=27661] _get_matching_timerange()
[1534494019.594088] [016.1] [pid=27661] Rescheduled next service check for Fri Aug 17 10:22:19 2018
[1534494019.594095] [001.0] [pid=27661] schedule_service_check()
This is the corresponding service definition:

Code: Select all

# Cron
define service{
    use                     generic-service
    host_name               backup
    service_description     Cron
    parents                 PING
    check_command           check_nrpe!check_cron
    servicegroups           daemons
}
after removing the parents directive everything works again

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Fri Aug 17, 2018 12:34 pm
by scottwilkerson
Would it be possible to look in the status.dat for the parent service and post the value for last_hard_state

It is possible that this didn't get updated when you were running a previous 4.4.x branch.

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Posted: Fri Aug 17, 2018 2:03 pm
by scottwilkerson
Also, does PING have a service parent assigned that is down?