Page 1 of 1

Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 10:21 am
by skynardo
I am experimenting with NRDS running passive checks (custom plugins) on our AIX lpars. I was wondering if there was a way to delay the actual alert from being triggered until the check reported it was above threshold for X number of checks or minutes? So if I am checking CPU % idle every 5 minutes and only want to alert if it is above threshold for 1 hour, is this possible to configure on the server side or do I have to handle that logic in the plugin?

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 10:49 am
by jdalrymple
Absolutely, adjust these directives to fit your needs:

max_check_attempts: This directive is used to define the number of times that Nagios will retry the service check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the service check again.

retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the service. Services are rescheduled at the retry interval when they have changed to a non-OK state. Once the service has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 12:31 pm
by skynardo
OK, I changed one of my passive service definitions to Check Interval 5, Retry Interval 5 (since this is the frequency of my nrds.pl cron job) and Max check attempts to 12.
On first check 1/12, the service shows in a WARNING state under service details and Operations Center on the Nagios XI UI. Now that I think about this, Im wondering if this is working as designed, the service state is displayed but it is a SOFT state and notifications won't be made until if/when a HARD state is reached i.e. 12/12 checks? Is there a way to create an Operator View that only displays HARD states (real Alerts)?

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 1:11 pm
by jdalrymple
Sorry skynardo - I missed a very key part of your initial post - that being these are passive checks. For those the retry_interval won't have any affect since the retry_interval is determined solely by how often the passive check result is submitted by the client. Nonetheless, if you use your cron interval from the AIX machine and the max_check_attempts directive you should still be able to achieve what you want.

Regarding your other query ... are you seeking to have the UI "ignore" warning/critical states except outside of the retry_interval? I don't think that's an easy feat with the product as is. The max_check_attempts wasn't intended to help operators ignore problems but rather to keep the product from waking people up during potential anomalous plugin results. Maybe the solution to that problem would be to lengthen the check_interval to give a broader sampling of your metrics?

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 2:04 pm
by skynardo
Our current Event Mgmt solution only displays alerts to Operators that are considered actionable. If we have a single anomalous failed website check for instance, we don't display an alert to the console unless it fails on retry or if CPU spikes for 15 minutes we don't consider this a problem so don't show it to the operators. We will just need to teach our operators to change their mindset a bit and maybe pay attention to the Attempt column before acting.

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 2:39 pm
by jdalrymple
I understand your thought process...

I can think of a way that the goal could be achieved using service dependencies and multiple services per "real" service, combined with servicegroups it could be doable even at scale...

That does seem a bit like a square peg/round hole type situation, I'd have to leave it to your better judgement to decide if that was the proper course of action over retraining staff. If you'd like us to model up a way that it could be done though, let us know and we'll try to create a configuration to demonstrate.

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 4:38 pm
by skynardo
I think we will work the way things are designed and see how that works out. Since I am writing my own plugins I could easily withhold the alert on the client side until I was ready to send it, but will probably only resort to that if it becomes necessary. One more related question. How is availability reporting calculated. Does it only use HARD events to determine DOWN/UP times or is each check used?

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 4:47 pm
by tmcdonald
Have you tried using "first_notification_delay"?
This directive is used to define the number of "time units" to wait before sending out the first problem notification when this host enters a non-UP state. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will start sending out notifications immediately.
Also works for services.

Re: Passive alerts after specified time period?

Posted: Thu Mar 05, 2015 4:59 pm
by jdalrymple
Contrary to what you might expect, states do have context in reporting availability. If your host or service transitions back out of it's SOFT state into an OK state availability is unaffected. Hopefully that's behavior that suits you best as that's probably not an easy thing to change, it's part of the Core code.

Speaking as a user, not an agent, that is the best part of Nagios is the ability to make the plugin do what you want irrespective of Nagios' behavior. I am often hesitant to suggest customizing plugins to XI users though as many of them prefer (as they should) a cookie cutter solution that does exactly what they want.

Re: Passive alerts after specified time period?

Posted: Fri Mar 06, 2015 11:38 am
by skynardo
Yes, that makes sense and is all good. Thanks for the help.