Page 1 of 1

When host goes down, services go straight to hard critical

Posted: Fri Apr 05, 2019 9:34 am
by cedricroijakkers
Hi All,

We've recently upgraded to NagiosXI 5.5.10, and noticed strange behaviour in services going straight to hard critical when even the first check failed. Looking through the forum, I've found that this is indeed new behaviour (https://support.nagios.com/forum/viewto ... 16&t=52032). Since we've upgraded from 5.3 straight to 5.5, and we use a downstream application behind Nagios to alert throughout the company, this has now changes behaviour.

Is it possible to configure Nagios to revert to the old behaviour, of not going straight to hard critical when the host is down, even when the host is still in soft critical?

Re: When host goes down, services go straight to hard critic

Posted: Fri Apr 05, 2019 9:53 am
by swolf
Hi @cedricroijakkers

This issue is fixed in Nagios Core 4.4.3, which is included with Nagios XI 5.5.11. If you want to return to the old behavior, your best option is to upgrade to this latest version.

Please let us know if you have any other questions or concerns.

Re: When host goes down, services go straight to hard critic

Posted: Wed Apr 17, 2019 4:10 am
by cedricroijakkers
Hi,

So, I've upgrade to XI 5.5.11 last week, and I still see the same behaviour.

I've created a dummy host, with a check script that I can toggle between OK/CRITICAL and a dummy service with the same script that I can independently toggle between OK/CRITICAL.

Both checks are set to retry 5 times before going to HARD.

I've initialised the checks to both be OK, then switched the service script to CRITICAL. This results in 4 SOFT-CRITICAL states, and then 1 HARD-CRITICAL state, as expected. In the screenshot you can see these as the 5 attempts on 10:15:29, 10:19:56, 10:20:00, 10:20:03, and 10:20:07.

Then switched the service script to OK, ran it once, and then the service goes to HARD-OK and stays there. This happens in the screenshot at 10:20:18.

At this point, the host script was switched to CRITICAL and the check ran 1 time. This resulted in the host going SOFT-DOWN, which you can see in the screenshot at 10:22:03. I then switched the service script to CRITICAL too, and ran it once. This resulted in the service going to HARD-CRITICAL with try 1 of 5, you can see this in the screenshot at 10:22:12.

Now, this is weird to me: it seems while the host is SOFT-DOWN, the service associated with that host goes to HARD-CRITICAL on the first check, while I have explicitly configured the service check to fail 5 times before going to HARD-CRITICAL. I can understand that the service check is not actually executed when the host is marked as DOWN (be it SOFT or HARD) and the service check being marked as CRITICAL, but I would expect the service check to pass over SOFT states (5 times) to finally end up in a HARD state, as is configured. Now, the service goes straight to HARD-CRITICAL, even though the check only failed once. This can be seen in the screenshot at 10:22:12, the state is HARD-CRITICAL with 1 try of 5.

I've repeated the procedure, set both host and service to UP and OK (10:23:03 and 10:23:58), then set the host to SOFT-DOWN (10:58:18) and the service check went immediately to HARD-CRITICAL again with attempt 1 of 5 (10:58:26).

Re: When host goes down, services go straight to hard critic

Posted: Wed Apr 17, 2019 11:33 am
by lmiltchev
Is it possible to configure Nagios to revert to the old behaviour, of not going straight to hard critical when the host is down, even when the host is still in soft critical?
No, unfortunately this is not possible.
I can understand that the service check is not actually executed when the host is marked as DOWN (be it SOFT or HARD) and the service check being marked as CRITICAL, but I would expect the service check to pass over SOFT states (5 times) to finally end up in a HARD state, as is configured. Now, the service goes straight to HARD-CRITICAL, even though the check only failed once.
I cannot remember exactly when the behavior changed, but currently the service goes to a CRITICAL state right away if the host is DOWN or UNREACHABLE as per our official Nagios documentation. So, other words, this is the expected behavior.
Hard States

Hard states occur for hosts and services in the following situations:

When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This is a hard error state.
When a host or service transitions from one hard error state to another error state (e.g. WARNING to CRITICAL).
When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.
When a host or service recovers from a hard error state. This is considered to be a hard recovery.
When a passive host check is received. Passive host checks are treated as HARD unless the passive_host_checks_are_soft option is enabled.
https://assets.nagios.com/downloads/nag ... types.html