Issue with service escalations (treat ack different)
Posted: Mon Sep 03, 2018 10:32 am
What I'm trying to achieve is the following:
Send out a message to Slack when a service changes to warning or critical around the clock. Repeat this notification every day until it is acknowledged or changed to OK.
After playing around with escalations I was finally able to manage using the following strategy:
- create a copy of the contact slack user with notification period "workhours"
- create an escalation with first_notification 1 that repeats every 8 hours (a working day has 8 working hours, so basically repeating every day)
define contact{
contact_name slack-workhours
use generic-contact
alias Slack Channel for repeated notifications (during working hours)
service_notification_commands notify-service-by-slack
host_notification_commands notify-host-by-slack
service_notification_period workhours
}
define serviceescalation{
host_name *
service_description *
# hostgroup_name !prod-env ; escalation is not valid for production environment
first_notification 2 ; after the first notification this escalation should kick in ..
last_notification 0 ; .. and repeat forever
contacts slack-workhours ; send notification to slack-workhours (which accepts notifications only durin workhours)
notification_interval 480 ; problem notification should be repeated every day (8 hours working day)
escalation_period 24x7 ; this escalation is valid 24x7 ..
escalation_options w,c,u ; .. for the states warning, critical and unknown
}
This works perfectly for warning, critical, unknown and OK. Alerts outside business hours are send to Slack (but repeated notification are only send during business hours) and recovery notifications are also send during the night. This way looking in Slack gives sufficient information outside business hours without the need to check Nagios
.
Now the "BUT"... Acknowledgements are suppressed as well
If an engineer acknowledges a service alert, he is busy with this issue. I don't want that other engineers are informed and don't need to check Nagios finding out that someone else is already looking into it...
How can I achieve the use case described above while acknowledgements are being sent out outside business hours.
Thanks in advance.
Send out a message to Slack when a service changes to warning or critical around the clock. Repeat this notification every day until it is acknowledged or changed to OK.
After playing around with escalations I was finally able to manage using the following strategy:
- create a copy of the contact slack user with notification period "workhours"
- create an escalation with first_notification 1 that repeats every 8 hours (a working day has 8 working hours, so basically repeating every day)
define contact{
contact_name slack-workhours
use generic-contact
alias Slack Channel for repeated notifications (during working hours)
service_notification_commands notify-service-by-slack
host_notification_commands notify-host-by-slack
service_notification_period workhours
}
define serviceescalation{
host_name *
service_description *
# hostgroup_name !prod-env ; escalation is not valid for production environment
first_notification 2 ; after the first notification this escalation should kick in ..
last_notification 0 ; .. and repeat forever
contacts slack-workhours ; send notification to slack-workhours (which accepts notifications only durin workhours)
notification_interval 480 ; problem notification should be repeated every day (8 hours working day)
escalation_period 24x7 ; this escalation is valid 24x7 ..
escalation_options w,c,u ; .. for the states warning, critical and unknown
}
This works perfectly for warning, critical, unknown and OK. Alerts outside business hours are send to Slack (but repeated notification are only send during business hours) and recovery notifications are also send during the night. This way looking in Slack gives sufficient information outside business hours without the need to check Nagios
Now the "BUT"... Acknowledgements are suppressed as well
If an engineer acknowledges a service alert, he is busy with this issue. I don't want that other engineers are informed and don't need to check Nagios finding out that someone else is already looking into it...
How can I achieve the use case described above while acknowledgements are being sent out outside business hours.
Thanks in advance.