Issue with service parents after upgrading to 4.4.0 / 4.4.1

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
intercorp
Posts: 3
Joined: Fri Jun 22, 2018 3:37 am

Issue with service parents after upgrading to 4.4.0 / 4.4.1

Post by intercorp »

Hello,
after upgrading from 4.3.4 to 4.4.0 the service parents aren't working correct anymore. When a service enters an "unknown" or "error" state, rechecking always fails. When I force a recheck via webinterface, the time for "next scheduled check" raises, but "last check time" stays at the same value.

So I activated the debug log with following settings:

Code: Select all

debug_level=16
debug_verbosity=2
and rescheduled the check. These are the corresponding lines I found in nagios.debug file:

Code: Select all

[1530604260.156238] [016.0] [pid=1139] Scheduling a forced, active check of service 'CPU Load' on host 'backup' @ Tue Jul  3 09:50:58 2018
[1530604260.156268] [016.2] [pid=1139] Found another service check event for this service @ Tue Jul  3 09:51:34 2018
[1530604260.156279] [016.2] [pid=1139] New service check event is forced and occurs before the existing event, so the new event will be used instead.
[1530604260.156291] [016.2] [pid=1139] Scheduling new service check event.
[1530604260.156324] [016.0] [pid=1139] Attempting to run scheduled check of service 'CPU Load' on host 'backup': check options=1, latency=0.000015
[1530604260.156355] [016.2] [pid=1139] Execution parents for this service failed, so it will not be actively checked.
[1530604260.156364] [016.1] [pid=1139] Unable to run scheduled service check at this time
[1530604260.156376] [016.1] [pid=1139] Rescheduled next service check for Tue Jul  3 09:56:00 2018
[1530604260.156385] [016.0] [pid=1139] Scheduling a forced, active check of service 'CPU Load' on host 'backup' @ Tue Jul  3 09:56:00 2018
[1530604260.156392] [016.2] [pid=1139] Scheduling new service check event.
[1530604262.285685] [016.0] [pid=1139] Attempting to run scheduled check of service 'CPU Usage' on host 'backup': check options=0, latency=0.000000
[1530604262.285745] [016.2] [pid=1139] Execution parents for this service failed, so it will not be actively checked.
[1530604262.285791] [016.1] [pid=1139] Unable to run scheduled service check at this time
[1530604262.285802] [016.1] [pid=1139] Rescheduled next service check for Tue Jul  3 09:56:02 2018
[1530604262.285810] [016.0] [pid=1139] Scheduling a non-forced, active check of service 'CPU Usage' on host 'backup' @ Tue Jul  3 09:56:02 2018
[1530604262.285822] [016.2] [pid=1139] Scheduling new service check event.
So it seems, that the check isn't executed because the parent check is failing, but the parent service (PING) ist in state OK.

This is only one example, I've seen this behaviour on other checks with parents and on a second nagios instance which I also upgraded to version 4.4.0. The upgrade to 4.4.1 didn't solve the Problem. The check runs without error, when I remove the parents line from the service definition.

The service definition as an example:

Code: Select all

define service{
    use                     generic-service,graphed-service
    host_name               backup
    service_description     CPU Load
    parents                 PING
    check_command           check_nrpe!check_load
    servicegroups           system-health
}
kind regards
Stephan
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by bheden »

First and foremost: thank you for the detailed description of the problem.

A lot of the code regarding scheduling/handling results was refactored in 4.4.0 - we "paid back a lot of technical debt" so to speak. I think I have an idea of where the issue lies (somewhere in that refactoring) and I'll be investigating shortly.

I'll post back with news.
99 little bugs in the code 99 little bugs in the code Take one down, patch it around 117 little bugs in the code
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by bheden »

Can you set debug_level to 17 and send me the updated appropriate log snippet? Thanks.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
intercorp
Posts: 3
Joined: Fri Jun 22, 2018 3:37 am

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by intercorp »

I'm sorry for the late reply, I was on vacation...

So here is the log with debug_level 17:

Code: Select all

[1534156159.982525] [016.0] [pid=15751] Scheduling a forced, active check of service 'Cron' on host 'backup' @ Mon Aug 13 12:29:01 2018
[1534156159.982541] [016.2] [pid=15751] Found another service check event for this service @ Mon Aug 13 12:34:07 2018
[1534156159.982552] [016.2] [pid=15751] New service check event is forced, so it will be used instead of the existing event.
[1534156159.982573] [016.2] [pid=15751] Scheduling new service check event.
[1534156159.982587] [001.0] [pid=15751] add_event()
[1534156159.982618] [001.0] [pid=15751] handle_timed_event() start
[1534156159.982633] [001.0] [pid=15751] run_scheduled_service_check() start
[1534156159.982642] [016.0] [pid=15751] Attempting to run scheduled check of service 'Cron' on host 'backup': check options=1, latency=0.000031
[1534156159.982656] [001.0] [pid=15751] run_async_service_check()
[1534156159.982662] [001.0] [pid=15751] check_service_check_viability()
[1534156159.982669] [001.0] [pid=15751] check_service_parents()
[1534156159.982675] [016.2] [pid=15751] Execution parents for this service failed, so it will not be actively checked.
[1534156159.982687] [016.1] [pid=15751] Unable to run scheduled service check at this time
[1534156159.982694] [001.0] [pid=15751] get_next_valid_time()
[1534156159.982702] [001.0] [pid=15751] _get_matching_timerange()
[1534156159.982711] [016.1] [pid=15751] Rescheduled next service check for Mon Aug 13 12:34:19 2018
[1534156159.982717] [001.0] [pid=15751] schedule_service_check()
kind regards,
stephan
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by bheden »

I believe this issue has since been resolved in the current working branch of Core. Can you compile the current maint branch and see if that resolves your issue?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
intercorp
Posts: 3
Joined: Fri Jun 22, 2018 3:37 am

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by intercorp »

I've upgraded to 4.4.2, but unfortunately it didn't solve the issue. nagios.debug says:

Code: Select all

[1534494019.593963] [016.0] [pid=27661] Scheduling a forced, active check of service 'Cron' on host 'backup' @ Fri Aug 17 10:20:17 2018
[1534494019.593971] [016.2] [pid=27661] Found another service check event for this service @ Fri Aug 17 10:20:36 2018
[1534494019.593978] [016.2] [pid=27661] New service check event is forced and occurs before the existing event, so the new event will be used instead.
[1534494019.593986] [016.2] [pid=27661] Scheduling new service check event.
[1534494019.593996] [001.0] [pid=27661] add_event()
[1534494019.594016] [001.0] [pid=27661] handle_timed_event() start
[1534494019.594025] [001.0] [pid=27661] run_scheduled_service_check() start
[1534494019.594032] [016.0] [pid=27661] Attempting to run scheduled check of service 'Cron' on host 'backup': check options=1, latency=0.000022
[1534494019.594041] [001.0] [pid=27661] run_async_service_check()
[1534494019.594047] [001.0] [pid=27661] check_service_check_viability()
[1534494019.594053] [001.0] [pid=27661] check_service_parents()
[1534494019.594060] [016.2] [pid=27661] Execution parents for this service failed, so it will not be actively checked.
[1534494019.594066] [016.1] [pid=27661] Unable to run scheduled service check at this time
[1534494019.594072] [001.0] [pid=27661] get_next_valid_time()
[1534494019.594080] [001.0] [pid=27661] _get_matching_timerange()
[1534494019.594088] [016.1] [pid=27661] Rescheduled next service check for Fri Aug 17 10:22:19 2018
[1534494019.594095] [001.0] [pid=27661] schedule_service_check()
This is the corresponding service definition:

Code: Select all

# Cron
define service{
    use                     generic-service
    host_name               backup
    service_description     Cron
    parents                 PING
    check_command           check_nrpe!check_cron
    servicegroups           daemons
}
after removing the parents directive everything works again
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by scottwilkerson »

Would it be possible to look in the status.dat for the parent service and post the value for last_hard_state

It is possible that this didn't get updated when you were running a previous 4.4.x branch.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issue with service parents after upgrading to 4.4.0 / 4.

Post by scottwilkerson »

Also, does PING have a service parent assigned that is down?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked