Service alerts are issued prematurely when host goes down

fasterfourier · Post by **fasterfourier** » Thu Sep 27, 2018 10:57 am

Looking at that closer, this seems to be the sequence of events:

-PING services goes into soft critical at T=0s
-Host goes into soft down at T=30s
-Service notifications go out at T=70s
-Service goes into hard critical at T=70s
-Host alerts go out at T=120s
-Host goes into hard down at T=120s

The host in this case has a normal check interval of 5min, a retry check interval of 1min, and a max_check_attempts of 2. The service has a normal check interval of 5min, a retry check interval of 1min, and a max_check_attempts of 3. This is according to the configuration display table in the web UI.

I'm having trouble understanding why the above is happening when the config indicates it should happen otherwise.

scottwilkerson · Post by **scottwilkerson** » Thu Sep 27, 2018 11:10 am

I would need to see all the configurations to know for sure, but based on what you say, the service notifications should not be going out at 70s, it shouldn't be until several minutes later.

fasterfourier · Post by **fasterfourier** » Thu Sep 27, 2018 12:50 pm

Are configurations shown in the web GUI a comprehensive summary of all of the linked confugrations? If not, what can I post here to clear this up? I can post the service and host config, along with any linked templates, if that helps.

scottwilkerson · Post by **scottwilkerson** » Thu Sep 27, 2018 1:30 pm

Looking again at your nagios.cfg I noticed you are lacking the following 2 directives. While these are supposed to be enabled by default, can you add them, restart nagios and see if the problem persists

Code: Select all

enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1

If the problem does persist, we would likely need all the configuration files from the system to attempt to re-create the issue

fasterfourier · Post by **fasterfourier** » Fri Sep 28, 2018 9:42 am

Scott, it does look like I have those two options enabled in my nagios.cfg:

Code: Select all

# ENABLE PREDICTIVE HOST DEPENDENCY CHECKS
# This option determines whether or not Nagios will attempt to execute
# checks of hosts when it predicts that future dependency logic test
# may be needed.  These predictive checks can help ensure that your
# host dependency logic works well.
# Values:
#  0 = Disable predictive checks
#  1 = Enable predictive checks (default)

enable_predictive_host_dependency_checks=1



# ENABLE PREDICTIVE SERVICE DEPENDENCY CHECKS
# This option determines whether or not Nagios will attempt to execute
# checks of service when it predicts that future dependency logic test
# may be needed.  These predictive checks can help ensure that your
# service dependency logic works well.
# Values:
#  0 = Disable predictive checks
#  1 = Enable predictive checks (default)

enable_predictive_service_dependency_checks=1

Is there a way I can get my entire config over to you privately?

scottwilkerson · Post by **scottwilkerson** » Fri Sep 28, 2018 9:52 am

You had already posted it here
https://support.nagios.com/forum/viewto ... 84#p263025

I'm sorry when I was searching it yesterday, I somehow didn't see the entries.

What I don't understand is why your services are going into hard critical at T=70s if you in fact have the max_check_attempts set to 3 because you should have 3 1 minute spans before the notifications go out.

fasterfourier · Post by **fasterfourier** » Fri Sep 28, 2018 10:14 am

I don't get it either. I went back in the logs to before I did the 4.4.2 upgrade, and everything happens as expected: 3 checks at 1 minute intervals before the notification goes out:

Code: Select all

Service Ok[08-30-2018 13:47:25] SERVICE ALERT: sbh_annap_t1;PING;OK;HARD;3;PING OK - Packet loss = 0%, RTA = 25.93 ms
Service Critical[08-30-2018 13:42:31] SERVICE ALERT: sbh_annap_t1;PING;CRITICAL;HARD;3;PING CRITICAL - Packet loss = 100%
Service Critical[08-30-2018 13:41:31] SERVICE ALERT: sbh_annap_t1;PING;CRITICAL;SOFT;2;PING CRITICAL - Packet loss = 100%
Service Critical[08-30-2018 13:40:31] SERVICE ALERT: sbh_annap_t1;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100%

scottwilkerson · Post by **scottwilkerson** » Fri Sep 28, 2018 2:10 pm

After going through this, I have confirmed this is a bug in Core and have filed a bug report on Github
https://github.com/NagiosEnterprises/na ... issues/584

fasterfourier · Post by **fasterfourier** » Fri Sep 28, 2018 2:36 pm

Thank you for thoroughly investigating this, Scott. Do you have any info on when the bug was introduced (so I can roll back to an unaffected version) or whether there is a workaround?

scottwilkerson · Post by **scottwilkerson** » Fri Sep 28, 2018 2:40 pm

fasterfourier wrote:Thank you for thoroughly investigating this, Scott. Do you have any info on when the bug was introduced (so I can roll back to an unaffected version) or whether there is a workaround?

My best guess would be at of after 4.4.0

I know 4.3.4 was extremely stable, and would be a good target to go to.

Nagios Support Forum

Service alerts are issued prematurely when host goes down

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow

Re: Service alerts are issued prematurely when host goes dow