Services Immediately Moving To Hard Critical Status

NCATmax · Post by **NCATmax** » Wed Jan 16, 2019 1:31 pm

Hello,

Please bare with me, my Nagios knowledge is limited. I would be happy to provide any additional information.

REHL 7.6
64 bit REHL 7.6
Manual installation of Nagios XI
We have enabled SSL for the web interface, the databases are run on a separate server (per Nagios directions PDF)
Currently running Nagios XI 5.5.5

The problem that I am seeing is that services are immediately going to a hard critical status, before going through their maximum number of retries. This causes notifications to be sent out for these failed services before the host has reached a hard critical status.

I am focusing on one incident that occurred on Jan 13. The Nagios XI server lost connectivity to a group of servers (so host and service checks are both affected). I observed this behavior on hosts within that group. I should note that those servers are all configured similarly in Nagios.

I have attempted to configure host checks to go down (hard) before service checks. Host checks are done every 5 minutes, with a maximum retry attempts value of 3, with a retry interval of 1 minute. Service checks are done every 5 minutes, with a maximum retry attempts value of 5 with a retry interval of 1 minute.

I would expect that once a service goes down (soft), it will retry every minute, 5 times (4 more times). During this 5 minute window, I would expect a host check to occur, and fail, marking the host as soft down. I would then expect the service to continue its retries until it reaches the maximum number of retries, and then it turn to a hard critical state.

What I have observed is that after the host is marked as soft down, the next service check immediately goes to hard critical state, and notifications for the service checks are sent.

I have attached two images, one showing the state history for a particular host, and the second showing notifications that were sent from the host.

On the state history image, I have drawn some colored dots:

The blue dots are the host checks
The red dots are when notifications were sent
The green dots are services that immediately turned to hard critical status
Two green dots indicate that the service immediately went to hard critical status, but notifications were not sent.

nagiosxi_01-host_state_history.png

nagiosxi_02-host_notifications.png

I am trying to have hosts fail before services. What happened is that I got notifications that a lot of services went down before the host went down. I would like to avoid that.

Am I misunderstanding how this works?
Should I configure things differently?
Why did those two services with the two green dots not send notifications, unlike the other services that went to hard critical?

I apologize for writing so much, I greatly appreciate any insight into this situation. And again, I would be glad to provide any extra info.

npolovenko · Post by **npolovenko** » Wed Jan 16, 2019 5:11 pm

@NCATmax, You're right with this observation. If a host goes down(even Soft Down) then its services will automatically go Hard Critical on their next recheck. This improvement was made for resource consumption purposes. If a host=server is down it doesn't make sense to keep checking its services. It's logical to assume that if the server is down then its CPU, Memory and everything else will not respond to Nagios checks either.

But the problem with notifications was resolved in the Core 4.4.3:

Fixed notifications sending when services went into hard state on a down or unreachable host (#584)

Now, if the host is Critical its services will automatically go critical(same as before) BUT they will not send any notifications.
So you will only receive a host Hard Critical notification.
Let me know if all of this makes sense so far?

PS: The XI update 5.5.9 with integrated Core 4.4.3 is coming out later this week.

NCATmax · Post by **NCATmax** » Thu Jan 17, 2019 9:30 am

npolovenko wrote:@NCATmax, You're right with this observation. If a host goes down(even Soft Down) then its services will automatically go Hard Critical on their next recheck. This improvement was made for resource consumption purposes. If a host=server is down it doesn't make sense to keep checking its services. It's logical to assume that if the server is down then its CPU, Memory and everything else will not respond to Nagios checks either.

But the problem with notifications was resolved in the Core 4.4.3:
Fixed notifications sending when services went into hard state on a down or unreachable host (#584)
Now, if the host is Critical its services will automatically go critical(same as before) BUT they will not send any notifications.
So you will only receive a host Hard Critical notification.
Let me know if all of this makes sense so far?

PS: The XI update 5.5.9 with integrated Core 4.4.3 is coming out later this week.

This makes complete sense. Glad to know that I wasn't too far off in my understanding.

I will be on the lookout for that update.

Small follow-up, do my host and service check attempts/intervals make sense from a monitoring standpoint? Should they be changed to better work with Nagios?

NCATmax wrote:I have attempted to configure host checks to go down (hard) before service checks. Host checks are done every 5 minutes, with a maximum retry attempts value of 3, with a retry interval of 1 minute. Service checks are done every 5 minutes, with a maximum retry attempts value of 5 with a retry interval of 1 minute.

npolovenko · Post by **npolovenko** » Thu Jan 17, 2019 4:42 pm

@NCATmax, Yes, your settings make look good to me. For service check I'd maybe do max check attempts 3 with the retry interval of 2. Fewer checks mean less load for your XI. This will be relevant when your monitoring system grows to 20 thousand services and hosts. Right now - not so much.
Btw, we released XI 5.5.9 today so you can upgrade.

Nagios Support Forum

Services Immediately Moving To Hard Critical Status

Services Immediately Moving To Hard Critical Status

Re: Services Immediately Moving To Hard Critical Status

Re: Services Immediately Moving To Hard Critical Status

Re: Services Immediately Moving To Hard Critical Status