Unexpected HARD/SOFT state changes - no recovery alert.

yo_marc · Post by **yo_marc** » Sat Jan 18, 2020 10:59 am

Hello Nagios Support.

Overnight our XI instance sent out an alert for a high priority service of ours, but did not send out a follow-up OK alert. We've got automation tied off of these alerts, and the missing recovery alert caused problems. The alerts go out to about 4,500 people - so there is a lot of folks depending on a timely recovery message.

Here is the sanitized/simplified Event Log showing the Host and Service we're discussing here:

Service Recovery 2020-01-17 23:31:14 SERVICE ALERT: myhost.example.com;MyHost HTTP;OK;SOFT;1;HTTP OK: HTTP/1.1 200 OK - 13573 bytes in 0.038 second response time
Service Notification 2020-01-17 23:30:45 SERVICE NOTIFICATION: user1;myhost.example.com;MyHost HTTP;CRITICAL;xi_service_notification_handler;CRITICAL - Socket timeout
Host Recovery 2020-01-17 23:30:34 HOST ALERT: myhost.example.com;UP;SOFT;1;OK - myhost.example.com: rta 0.354ms, lost 0%
Service Critical 2020-01-17 23:29:45 SERVICE ALERT: myhost.example.com;MyHost HTTP;CRITICAL;HARD;1;CRITICAL - Socket timeout
Host Down 2020-01-17 23:29:45 HOST ALERT: myhost.example.com;DOWN;SOFT;1;CRITICAL - myhost.example.com: rta nan, lost 100%

Reading from the bottom, up:
- The first Host Down SOFT 1 looks good. Host is set for 2 retries.
- The next Service Critical HARD 1 is unexpectedly a Hard state. The Service is set to retry 2 times, not 1.
- The Host recovers. Soft is expected here.
- Our alert goes out about a minute after the unexpected Service Critical entry.
- Service recovers to OK SOFT, so no recovery alert is sent.

This seems similar (if not the same) as a bug we saw months ago which I thought was fixed in Nagios Core 4.4.4 I believe. Seems like something similar is happening again. I have a previous support forum post on that issue.

I've seen similar issues to this if an 'Apply Config' happens to have been run when a Host or Service is in an transitional SOFT state, but that did not occur here. (Admittedly haven't seen that in a while).

Any help in resolving this would be greatly appreciated.

XI 5.6.7, running on Nagios Core 4.4.5.

Post by **mbellerue** » Mon Jan 20, 2020 3:10 pm

We could use some more information here. Would you be able to PM me a system profile? Or at least the Nagios logs for that day?

I think we're running into a race condition here, but I think you still should have received a recovery notification. If you can't send in the above, can you tell me if the service in question has a delay set on sending out the first notification?

Here's what we're seeing.
We get the alert that the host goes soft critical. At this point, if a service check that is related to that host executes, it will immediately be marked as hard critical. That's why the service check goes hard critical on the first try, and that is working as designed.

Then 49 seconds later, the host goes from soft critical to soft ok.

16 seconds after that, the event handler for the service kicks off, which checks to see if the host is up, which it is, and decides that it should send out a service critical notification. This is that race condition.

29 seconds after that, the service goes soft ok. This I think might be a bug. What I think is happening here is that Core sees that the service was forced hard critical without reaching its max retry count (HARD;1;CRITICAL). At that point Core says, "Oh, this service was marked as critical because of a dependency or parent/child relationship. I'm not sending out a notification for this." It just assumes a notification wasn't sent.

That's the current working theory. If you can get me more information, I will be happy to dig into it further.

yo_marc · Post by **yo_marc** » Mon Jan 20, 2020 4:25 pm

Thanks! Your explanation makes sense, thank for taking the time to walk through that.

I'll see what I can get PM'd over for system info / logs.

I 'can' say that there is no delay set for any of the hosts/services discussed here. Just checked to confirm.

Post by **mbellerue** » Mon Jan 20, 2020 4:33 pm

Awesome, we will keep this thread open and wait to hear back.

yo_marc · Post by **yo_marc** » Mon Jan 20, 2020 5:01 pm

Quick follow up question:

At least as you see it now, do you think enabling "host_down_disable_service_checks" would potentially help in this situation?

Or perhaps adding more retries before alerting?

Post by **mbellerue** » Mon Jan 20, 2020 5:54 pm

No, I suspect the host_down_disable_service_checks is already enabled, and that's why the service was forced to hard critical immediately.

And if that is the case, the additional retries wouldn't help, because the service is being forced to hard critical regardless of the number of retries.

yo_marc · Post by **yo_marc** » Tue Jan 21, 2020 8:56 am

host_down_disable_service_checks is actually 'not' set --- I was asking because it was a change I had planned to make this week.

Post by **mbellerue** » Tue Jan 21, 2020 3:13 pm

Ah, I see. This is a different setting than what I was thinking. My only concern with that option is that it might just shift the race condition, but it does get us away from a known race condition. I would say enable it, and keep an eye on notifications.

Nagios Support Forum

Unexpected HARD/SOFT state changes - no recovery alert.

Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.

Re: Unexpected HARD/SOFT state changes - no recovery alert.