No recovery alert; "OK;SOFT;1" state.
Posted: Wed Sep 18, 2019 1:02 pm
Hello Nagios Support,
This morning a critical server that we monitor triggered some alerts when it went down. We got the initial Host Down and one Service Problem alert, but when everything recovered we never got an important Service Recovery alert. (We have automation hooked into the alerts, and to make a long story short its important that we get ALL alerts.)
Looking through the event log, I see some strange behavior. To summarize succinctly:
1. We had some Service checks associated with the critical server unexpectedly get marked as "CRITICAL;HARD;1" - seemingly by passing the max-check-attempt counter.
2. Upon recovery, all of these Services got marked as "OK;SOFT;1" -- bypassing any notification process.
3. Looking at the XI UI now, I see that these Services are set to HARD states. I don't see any event log entry where/when that took place.
The one Service that did alert had one expected "CRITICAL;SOFT;1" entry before it logged the abnormal "CRITICAL;HARD;1". It was set to alert after 2 max-attempts, so this makes some degree of sense that it sent a notification - but there is obviously still something wrong here.
Have you seen this problem before, and do you know of a fix?
I am running XI 5.5.11 on a Centos 7.6 box.
This morning a critical server that we monitor triggered some alerts when it went down. We got the initial Host Down and one Service Problem alert, but when everything recovered we never got an important Service Recovery alert. (We have automation hooked into the alerts, and to make a long story short its important that we get ALL alerts.)
Looking through the event log, I see some strange behavior. To summarize succinctly:
1. We had some Service checks associated with the critical server unexpectedly get marked as "CRITICAL;HARD;1" - seemingly by passing the max-check-attempt counter.
2. Upon recovery, all of these Services got marked as "OK;SOFT;1" -- bypassing any notification process.
3. Looking at the XI UI now, I see that these Services are set to HARD states. I don't see any event log entry where/when that took place.
The one Service that did alert had one expected "CRITICAL;SOFT;1" entry before it logged the abnormal "CRITICAL;HARD;1". It was set to alert after 2 max-attempts, so this makes some degree of sense that it sent a notification - but there is obviously still something wrong here.
Have you seen this problem before, and do you know of a fix?
I am running XI 5.5.11 on a Centos 7.6 box.