Page 1 of 1

OK (return to normal) state delay

Posted: Fri Aug 14, 2020 2:35 pm
by Maxwellb99
Hi Nagios,

Use Case: we have a bunch of alerts that going critical or unknown due to timeout. (I'll start a separate thread for that.) The problem becomes, they'll send out an alert after getting a single OK state. Then they flip back to critical or unknown. This is causing way too many alerts.

Question:
- Is there a way to set a required threshold for OK alerts (ie. Is there a soft OK state)?

Note:
- We'd prefer not to use flapping.

Thanks,
Max

Re: OK (return to normal) state delay

Posted: Mon Aug 17, 2020 11:28 am
by ssax
Unfortunately, flapping is really the only thing you can do for this outside of increasing your max_check_attempts:

https://assets.nagios.com/downloads/nag ... pping.html

There are SOFT RECOVERY states but they only occur if you are in a SOFT PROBLEM state and then go back to an OK/UP. If the status is in a HARD PROBLEM state and an OK/UP is received it will always try to send the notification unless you have flapping set up to stop that from occurring.

Do the hosts show as down for them or is it only the services that are showing an issue? If the hosts show as down you can increase your service check_intervals to be higher than the host check and set host_down_disable_service_checks=1 in your /usr/local/nagios/etc/nagios.cfg (and restart the nagios service), that way if the host is down it won't even try to run the service checks. Also, make sure you're selecting the parents on the hosts so that the reachability logic works:

https://assets.nagios.com/downloads/nag ... ility.html

Let us know if you have any questions.

Thank you!

Re: OK (return to normal) state delay

Posted: Tue Aug 18, 2020 2:16 pm
by Maxwellb99
Hi Nagios,

Thanks for your response. I've got "host_down_disable_service_checks=1" enabled. Unfortunately the hosts are still ping-able. I'll open up another thread but the two cases we've found are 1. it goes unknown when port 5693 connection closes. (still troubleshooting this). 2. If Nagios doesn't get a response in a timely manner. Alright, I'll try to sell flapping to my management.

Thanks, I think you can close this thread.

Cheers,
Max

Re: OK (return to normal) state delay

Posted: Tue Aug 18, 2020 2:29 pm
by scottwilkerson
Maxwellb99 wrote:Hi Nagios,

Thanks for your response. I've got "host_down_disable_service_checks=1" enabled. Unfortunately the hosts are still ping-able. I'll open up another thread but the two cases we've found are 1. it goes unknown when port 5693 connection closes. (still troubleshooting this). 2. If Nagios doesn't get a response in a timely manner. Alright, I'll try to sell flapping to my management.

Thanks, I think you can close this thread.

Cheers,
Max
Ok

Closing thread