Unexpected HARD/SOFT state changes - no recovery alert.
Posted: Sat Jan 18, 2020 10:59 am
Hello Nagios Support.
Overnight our XI instance sent out an alert for a high priority service of ours, but did not send out a follow-up OK alert. We've got automation tied off of these alerts, and the missing recovery alert caused problems. The alerts go out to about 4,500 people - so there is a lot of folks depending on a timely recovery message.
Here is the sanitized/simplified Event Log showing the Host and Service we're discussing here:
- The first Host Down SOFT 1 looks good. Host is set for 2 retries.
- The next Service Critical HARD 1 is unexpectedly a Hard state. The Service is set to retry 2 times, not 1.
- The Host recovers. Soft is expected here.
- Our alert goes out about a minute after the unexpected Service Critical entry.
- Service recovers to OK SOFT, so no recovery alert is sent.
This seems similar (if not the same) as a bug we saw months ago which I thought was fixed in Nagios Core 4.4.4 I believe. Seems like something similar is happening again. I have a previous support forum post on that issue.
I've seen similar issues to this if an 'Apply Config' happens to have been run when a Host or Service is in an transitional SOFT state, but that did not occur here. (Admittedly haven't seen that in a while).
Any help in resolving this would be greatly appreciated.
XI 5.6.7, running on Nagios Core 4.4.5.
Overnight our XI instance sent out an alert for a high priority service of ours, but did not send out a follow-up OK alert. We've got automation tied off of these alerts, and the missing recovery alert caused problems. The alerts go out to about 4,500 people - so there is a lot of folks depending on a timely recovery message.
Here is the sanitized/simplified Event Log showing the Host and Service we're discussing here:
Reading from the bottom, up:Service Recovery 2020-01-17 23:31:14 SERVICE ALERT: myhost.example.com;MyHost HTTP;OK;SOFT;1;HTTP OK: HTTP/1.1 200 OK - 13573 bytes in 0.038 second response time
Service Notification 2020-01-17 23:30:45 SERVICE NOTIFICATION: user1;myhost.example.com;MyHost HTTP;CRITICAL;xi_service_notification_handler;CRITICAL - Socket timeout
Host Recovery 2020-01-17 23:30:34 HOST ALERT: myhost.example.com;UP;SOFT;1;OK - myhost.example.com: rta 0.354ms, lost 0%
Service Critical 2020-01-17 23:29:45 SERVICE ALERT: myhost.example.com;MyHost HTTP;CRITICAL;HARD;1;CRITICAL - Socket timeout
Host Down 2020-01-17 23:29:45 HOST ALERT: myhost.example.com;DOWN;SOFT;1;CRITICAL - myhost.example.com: rta nan, lost 100%
- The first Host Down SOFT 1 looks good. Host is set for 2 retries.
- The next Service Critical HARD 1 is unexpectedly a Hard state. The Service is set to retry 2 times, not 1.
- The Host recovers. Soft is expected here.
- Our alert goes out about a minute after the unexpected Service Critical entry.
- Service recovers to OK SOFT, so no recovery alert is sent.
This seems similar (if not the same) as a bug we saw months ago which I thought was fixed in Nagios Core 4.4.4 I believe. Seems like something similar is happening again. I have a previous support forum post on that issue.
I've seen similar issues to this if an 'Apply Config' happens to have been run when a Host or Service is in an transitional SOFT state, but that did not occur here. (Admittedly haven't seen that in a while).
Any help in resolving this would be greatly appreciated.
XI 5.6.7, running on Nagios Core 4.4.5.