Page 1 of 1
Service Dependency behaviour
Posted: Sun Jun 30, 2013 7:16 pm
by Fred Kroeger
I've created a Service dependency that works as expected if all the dependent services have a normal status.
eg: If Service A (check_nrpe-version) goes critical because the agent has stopped then no alerts are generated for Services B to Z if they were previously normal.
However if Service B has a Hard Warning status already , if the agent is stopped, it immediately goes to a hard Critical (connection refused).
I would have expected that it would go through the normal retries before becoming hard? An alert is then generated before Service A goes critical to stop the other services.
How can I get a service to go through the normal retries on a transition from Warning to Critical?
regards... Fred
Re: Service Dependency behaviour
Posted: Mon Jul 01, 2013 2:18 pm
by abrist
Just to clarify:
A service in a HARD WARNING state will move to a HARD CRITICAL state without retries when the remote agent is stopped, is that correct?
Re: Service Dependency behaviour
Posted: Wed Jul 03, 2013 2:55 am
by Fred Kroeger
That is correct
When the agent is stopped, Nagios displays a connection refused message with a Critical state.
If the serveice previously had a hard warning then it moves to a hard critical immediately.
regards... Fred
Re: Service Dependency behaviour
Posted: Wed Jul 03, 2013 1:22 pm
by slansing
The way this logic works is, if you have a service in a soft warning state, at..lets say 3/5 retries, and then it goes into a critical state. It will continue on the path of retries, leaving 2 left and then switch to a hard critical state. The only way this number resets is if the service returns to an OK state.
Re: Service Dependency behaviour
Posted: Tue Jul 09, 2013 2:38 am
by Fred Kroeger
If you're saying that the retry counter only resets when the status return to normal, then that means that a change to Critical will send an immediate notification and then a change back to Warning will send another immediate notification.
Surely you would want the retry counter to reset on every status change after it has gone hard - not just normal? Otherwise we're generating a lot of notifications (email/sms, etc).
Fred
Re: Service Dependency behaviour
Posted: Tue Jul 09, 2013 9:37 am
by sreinhardt
While I can see your point as to potential false positives and in a few cases some unexpected behavior. The alternative, is that notifications for a potential issue could be extremely delayed, and that would be far worse. Imagine a case where memory on a machine is fluctuating between warning and critical states, if the counter were reset between each state change, you could potentially either never receive an alert as it keeps getting reset or have very delayed notifications as it finally stayed in one state for long enough to alert.
Re: Service Dependency behaviour
Posted: Wed Jul 17, 2013 9:02 pm
by Fred Kroeger
Sorry - only just got back to following this one up.
That's not quite what I was thinking. I would expect that any state change will trigger the retry counter and continue until it goes hard.
Any other status change would then restart the counter again.
Re: Service Dependency behaviour
Posted: Thu Jul 18, 2013 10:36 am
by sreinhardt
You could submit this as a core feature request, precisely a toggle ability on any state change to reset the counter. However I think most people would agree with the possibility of ever resetting counters, it could lead to a large lag in notifications.