Nagios Support Forum

Posted: **Tue May 19, 2015 8:15 am**

So I'm using nagios core 4.0.8, and I have this situation with a service which enters a fail (crit) status and we know about it, we acknowledge the issue, and it stops bugging us.

Problem arises when the service check returns an OK status for one single check, but returns to critical on the next service check run.

When defining the service check we use the "max_check_attempts" option to prevent notifications to fire up as soon as the check fails (some times it will return a single check as failed, just to return to the ok status right after).

The question is: is there a possibility to implement the same logic for OK status? It seems like as soon as the service enters OK status (e.g. 1 single check returns OK) we get a hard state defined, and a notification fired.

This is specially annoying since we lookse the acknowledgment and we need to go again to the service info page and acknowledge the alert again.

It seems reasonable to me that, if we wait for 3, 5 or whatever amount of failed checks before determining a HARD state, so we do the same for an OK status. But looking at the way our alerts behave it seems otherwise.

Is there something I'm missing?

Thanks in advance

Posted: **Tue May 19, 2015 10:27 am**

There is nothing you're missing - first OK is always HARD.

You might be able to use first_notification_delay to your advantage though?

Object Definitions wrote:first_notification_delay: This directive is used to define the number of "time units" to wait before sending out the first problem notification when this service enters a non-OK state. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will start sending out notifications immediately.

Posted: **Wed May 20, 2015 7:05 am**

You could always disable notifications and use event handlers to process the events. Then you can determine if you are going to notify or not based on whether this is the first OK received or not. It complicates things, but Nagios assumes that when a service recovers, it recovers. If it goes bad again within a short period of time, flap detection code is activated to avoid alerting for something that is flapping between good and bad over and over.

Posted: **Wed May 20, 2015 12:53 pm**

yes - eloyds solution is actually better in your use-case than first_notification_delay now that I think about it. You're still eventually going to get notified. As a matter of fact it may be the only solution.

Let us know if we can help you set that up.

Posted: **Thu Jun 11, 2015 10:50 am**

Thanks for your replies. I had foreseen such solution with event handlers would be the only solution.

It does however, imho, break the whole usage of nagios. Or better said: disable standard nagios behaviour to use a custom one.

I acknowledge that Nagios thinks the first OK is always a HARD state. However it sounds kinda unwise to me to forcefully consider such behaviour. Why can an error state be considered as SOFT while a recover cannot? Either you consider all state changes as hard, or allow to consider them as soft, in both directions.

I believe there are lots of use cases where such behaviour would be the desired, cases in which we are forced to build complex workarounds using event handlers. Not nice.

I wonder if building such option would be too hard? I guess it has to be done inside the core?

Posted: **Thu Jun 11, 2015 11:16 am**

My suggestion, post a feature request on tracker.nagios.org

It's not a feature I've ever heard of a request for. As a generalization it sounds like flapping is something that happens a lot in your environment. We have flapping detection built in that works pretty well, and does indeed squelch other notifications:

http://nagios.sourceforge.net/docs/nagioscore/4/en/flapping.html wrote: Flap Handling

When a service or host is first detected as flapping, Nagios will:

Log a message indicating that the service or host is flapping.
Add a non-persistent comment to the host or service indicating that it is flapping.
Send a "flapping start" notification for the host or service to appropriate contacts.
Suppress other notifications for the service or host (this is one of the filters in the notification logic).

When a service or host stops flapping, Nagios will:

Log a message indicating that the service or host has stopped flapping.
Delete the comment that was originally added to the service or host when it started flapping.
Send a "flapping stop" notification for the host or service to appropriate contacts.
Remove the block on notifications for the service or host (notifications will still be bound to the normal notification logic).

Perhaps you don't see it as such, but I think enough other people in the world do that it's just never become an issue that I've seen before.

Posted: **Thu Jun 11, 2015 11:27 am**

Flapping would in fact be the perfect description for our issue here.

How do you define the limits for flapping detection? How does nagios actually "detect" flapping? I will have a look into the docu before bugging you all with more questions

Posted: **Thu Jun 11, 2015 12:34 pm**

Thanks for the suggestions once again. I have already created an issue/feature request in github (https://github.com/NagiosEnterprises/na ... /issues/46) because even the flapping detection doesn't exactly solve my situation.

As I commented out in github, the major reason why flapping detection is not useful in this case is that acknowledgement and non-persistent comments will disappear once the service flaps. IMHO flapping detection is a workaround to the problem I described, and simply applying the same logic for NON-OK -> OK status transition as it already exists for OK -> NON-OK transitions would do the trick.

Cheers

Posted: **Thu Jun 11, 2015 1:23 pm**

You would set the low and high flap thresholds.

Here is the documentation on flapping:

http://nagios.sourceforge.net/docs/3_0/flapping.html

Posted: **Thu Jun 11, 2015 1:27 pm**

Fair enough, thanks for the feature request post. I think with enough tuning of of the flapping thresholds, you could get close to the behavior you want. If a problem gets acknowledged, and the proper flapping thresholds are set, the ack will not be removed until the service has stopped flapping for a generous threshold (which is what you want to happen). If the ack never got removed, and the service stabilized for a long time only to go down later, you would not be notified due to the ack.

Nagios Support Forum

Checks before notification for OK status

Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status

Re: Checks before notification for OK status