When host goes down, services go straight to hard critical

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
cedricroijakkers
Posts: 2
Joined: Fri Apr 05, 2019 6:37 am

When host goes down, services go straight to hard critical

Post by cedricroijakkers »

Hi All,

We've recently upgraded to NagiosXI 5.5.10, and noticed strange behaviour in services going straight to hard critical when even the first check failed. Looking through the forum, I've found that this is indeed new behaviour (https://support.nagios.com/forum/viewto ... 16&t=52032). Since we've upgraded from 5.3 straight to 5.5, and we use a downstream application behind Nagios to alert throughout the company, this has now changes behaviour.

Is it possible to configure Nagios to revert to the old behaviour, of not going straight to hard critical when the host is down, even when the host is still in soft critical?
swolf

Re: When host goes down, services go straight to hard critic

Post by swolf »

Hi @cedricroijakkers

This issue is fixed in Nagios Core 4.4.3, which is included with Nagios XI 5.5.11. If you want to return to the old behavior, your best option is to upgrade to this latest version.

Please let us know if you have any other questions or concerns.
cedricroijakkers
Posts: 2
Joined: Fri Apr 05, 2019 6:37 am

Re: When host goes down, services go straight to hard critic

Post by cedricroijakkers »

Hi,

So, I've upgrade to XI 5.5.11 last week, and I still see the same behaviour.

I've created a dummy host, with a check script that I can toggle between OK/CRITICAL and a dummy service with the same script that I can independently toggle between OK/CRITICAL.

Both checks are set to retry 5 times before going to HARD.

I've initialised the checks to both be OK, then switched the service script to CRITICAL. This results in 4 SOFT-CRITICAL states, and then 1 HARD-CRITICAL state, as expected. In the screenshot you can see these as the 5 attempts on 10:15:29, 10:19:56, 10:20:00, 10:20:03, and 10:20:07.

Then switched the service script to OK, ran it once, and then the service goes to HARD-OK and stays there. This happens in the screenshot at 10:20:18.

At this point, the host script was switched to CRITICAL and the check ran 1 time. This resulted in the host going SOFT-DOWN, which you can see in the screenshot at 10:22:03. I then switched the service script to CRITICAL too, and ran it once. This resulted in the service going to HARD-CRITICAL with try 1 of 5, you can see this in the screenshot at 10:22:12.

Now, this is weird to me: it seems while the host is SOFT-DOWN, the service associated with that host goes to HARD-CRITICAL on the first check, while I have explicitly configured the service check to fail 5 times before going to HARD-CRITICAL. I can understand that the service check is not actually executed when the host is marked as DOWN (be it SOFT or HARD) and the service check being marked as CRITICAL, but I would expect the service check to pass over SOFT states (5 times) to finally end up in a HARD state, as is configured. Now, the service goes straight to HARD-CRITICAL, even though the check only failed once. This can be seen in the screenshot at 10:22:12, the state is HARD-CRITICAL with 1 try of 5.

I've repeated the procedure, set both host and service to UP and OK (10:23:03 and 10:23:58), then set the host to SOFT-DOWN (10:58:18) and the service check went immediately to HARD-CRITICAL again with attempt 1 of 5 (10:58:26).
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: When host goes down, services go straight to hard critic

Post by lmiltchev »

Is it possible to configure Nagios to revert to the old behaviour, of not going straight to hard critical when the host is down, even when the host is still in soft critical?
No, unfortunately this is not possible.
I can understand that the service check is not actually executed when the host is marked as DOWN (be it SOFT or HARD) and the service check being marked as CRITICAL, but I would expect the service check to pass over SOFT states (5 times) to finally end up in a HARD state, as is configured. Now, the service goes straight to HARD-CRITICAL, even though the check only failed once.
I cannot remember exactly when the behavior changed, but currently the service goes to a CRITICAL state right away if the host is DOWN or UNREACHABLE as per our official Nagios documentation. So, other words, this is the expected behavior.
Hard States

Hard states occur for hosts and services in the following situations:

When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This is a hard error state.
When a host or service transitions from one hard error state to another error state (e.g. WARNING to CRITICAL).
When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.
When a host or service recovers from a hard error state. This is considered to be a hard recovery.
When a passive host check is received. Passive host checks are treated as HARD unless the passive_host_checks_are_soft option is enabled.
https://assets.nagios.com/downloads/nag ... types.html
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked