Page 1 of 1

Host Escalation Question

Posted: Tue Apr 14, 2015 12:35 pm
by jkinning
I am trying to get some clarity with escalations. I have a host template which contains our District Office MPLS devices. I am just doing ICMP check - $USER1$/check_ping -H $HOSTADDRESS$ -w 1000.0,10% -c 2000.0,20% -p 3

I have the check settings set to 5 minute check interval with a 1 minute retry interval and a max check attempts of 5. If I want to setup a host escalation that sends a notification to a different group in 15 minutes would I create the new host escalations with a first notification value of 3 and last notification value of 3?

I currently have the regular contact group notification going to our network support group. In 15 minutes I need to send alerts to our service desk so they can contact our vendor to investigate.

Is that the correct method or is there a better way of doing what I am trying to accomplish?

Re: Host Escalation Question

Posted: Tue Apr 14, 2015 2:08 pm
by lmiltchev
The "first_notification" directive is a number that identifies the first notification for which this escalation is effective. For instance, if you set this value to 3, this escalation will only be used if the host is down or unreachable long enough for a third notification to go out.
When the 3rd (non-escalated) notification will go out would depend on the notification interval, which defines the number of "time units" to wait before re-notifying a contact that this host is still down or unreachable.

The "last_notification" directive is a number that identifies the last notification for which this escalation is effective. For instance, if you set this value to 3, this escalation will not be used if more than three notifications are sent out for the host.

Hope this helps.

Re: Host Escalation Question

Posted: Thu Apr 16, 2015 8:40 am
by jkinning
I currently have a 60 minute notification interval. So that value would need to be reduced to 5 minutes or I could set it at 15 minutes and then set the escalation "first_notification" to 1?

Re: Host Escalation Question

Posted: Thu Apr 16, 2015 1:44 pm
by lmiltchev
I would set the "notification_interval" in the host config equal to 15 minutes and the "first_notification" in the host notification equal to 1. This way, you will receive the first notification (not escalated) immediately after the host goes into hard non-ok state (if the first notification delay = 0), and in 15 minutes, you will get the second notification (escalated). This is what you are trying to accomplish, right?

Re: Host Escalation Question

Posted: Fri Apr 17, 2015 1:27 pm
by jkinning
Yes, I will just take that advice and change the notification interval to 15 minutes. Then just set both groups who want to be notified and they will receive the notification. What is the best way to prevent spamming them? I am currently using a 5 minute check interval with a 1 minute retry and max check attempts of 5. If I set the notification interval to 15 minutes when the host goes critical it will send out the notification after it has been down for 15 minutes? This is the part I am getting confused on. I was thinking if I have a 5 minute check interval with 1 minute retry and the max check attempts is 5 then Nagios would send out an alert after the check has been critical for 10 minutes correct? So, then would Nagios wait another 5 mintutes for 15 total minutes before sending out the notification?

Re: Host Escalation Question

Posted: Fri Apr 17, 2015 2:18 pm
by tmcdonald
jkinning wrote:I am currently using a 5 minute check interval with a 1 minute retry and max check attempts of 5. If I set the notification interval to 15 minutes when the host goes critical it will send out the notification after it has been down for 15 minutes?
The notification_interval determines how often a notification is re-sent. The first_notification_delay determines how long to wait before sending the first notification, instead of doing so immediately after it is determined that a notification must be sent. For example, with your current settings (no notification delay, 60-minute notification interval) this might happen:

Code: Select all

Time | Check       | Notification
-----+-------------+-------------
1:00 | OK          |
1:05 | OK          |
1:10 | CRITICAL 1  |
1:11 | CRITICAL 2  |
1:12 | CRITICAL 3  |
1:13 | CRITICAL 4  |
1:14 | CRITICAL 5  | Email 1
1:15 | CRITICAL 6  |
1:16 | CRITICAL 7  |
1:17 | CRITICAL 8  |
.... | ..........  |
2:14 | CRITICAL 65 | Email 2
Setting a first notification delay to 15 will change that first email from being sent at 1:14 to being sent at 1:29, and then every hour after then. Setting the notification interval to 20 (default 60) will send the first email at 1:14 still, but the second at 1:34.

Does this make sense?

Re: Host Escalation Question

Posted: Tue Apr 21, 2015 8:03 am
by jkinning
Yes and no. I think I am just confusing myself and considering how critical these systems are and that the Business Unit needs that have a 15 minute first notification is just adding pressure.

If I want 15 minutes I can leave my check settings to check interval 5 minutes, retry inverval 1 minute, and max check attempts to 5. That would fire off a notification after 11 minutes right or no? Or if I changed to check interval 1 minute, retry interval 1 minute and max check attempts to 13 but that would probably produce unnecessary overhead on the Nagios server.

So leaving everything at 5,1,5 and then set the first notification delay to 4 minutes would that then send the notification after the host/service has been down for 15 minutes?

This is one of those things where the harder I think about it the more confused I get. Please excuse my Nagios newbieness. :)

Re: Host Escalation Question

Posted: Tue Apr 21, 2015 10:40 am
by lmiltchev
This is somewhat relative. You have 10 min from the time the host was in OK state and notification was sent, for example:

12:00 OK
12:05 CRITICAL (soft)
12:06
12:07
12:08
12:09
12:10 CRITICAL (hard) -> Notification sent (if first_notification_delay = 0)

This doesn't mean your host was in a down state for 10 minutes before you were notified. It could've gone down anytime between 12:00 and 12:05. With the check interval of 5 min, Nagios will find out that the host is down at 12:05 (not sooner), which is OK in the majority of the cases. I would recommend leaving the defaults and playing with the first_notification_delay in order to accomplish your goal, unless you have a really good reason to check this host more often.

Re: Host Escalation Question

Posted: Tue Apr 21, 2015 1:27 pm
by jkinning
Thanks. I'll give this a go and see what happens.

Appreciate the clarification and recommendation! Feel free to close this.