Page 1 of 1
Complete Off The Wall Notification Handler Idea
Posted: Mon Mar 24, 2014 12:02 pm
by technick
Over the weekend I had a situation with a data center that went offline causing Nagios to flip out on a scale never seen before. Normally, when things go south, someone should get a page and ack / schedule downtime for all hosts / services that are broken. That didn't happen in my case as my monitoring infrastructure couldn't get pages out to the on call engineer but didn't have problems flooding everybodies inbox and opening 9000+ tickets. My services are configured to re-notify every 5 minutes on critical state, which IMHO is fair, especially for the line of work my business is in. This event inspired an idea which has haunted me all weekend.
I believe it would be possible to build a custom notification handler that did some logic work before sending the notification alert out. Using my weekend issues as an example, having a separate alert email for each service might be considered overkill, especially in my case it was several hundred something services in critical condition. It would be nice if we could do some logic and roll up all of these service alerts into a single alert email.
If a feature like this would be a major benefit to you or other users, lets get together collaborate on this.

Re: Complete Off The Wall Notification Handler Idea
Posted: Mon Mar 24, 2014 1:15 pm
by scottwilkerson
You aren't the first to ask for this, but the only way I see it as possible would be to force a delay in notifications because if they are sent real-time, there would be no way to aggregate them...
As it is, with a default setup, if you use a retry_interval, if a host goes down it should only send the host notification, and not one for every service. Additionally, you cna utilize parent/child relationships to change the state from down to UNREACHABLE and that is something you can decide to not notify on...
Re: Complete Off The Wall Notification Handler Idea
Posted: Mon Mar 24, 2014 5:32 pm
by technick
scottwilkerson wrote:You aren't the first to ask for this, but the only way I see it as possible would be to force a delay in notifications because if they are sent real-time, there would be no way to aggregate them...
As it is, with a default setup, if you use a retry_interval, if a host goes down it should only send the host notification, and not one for every service. Additionally, you cna utilize parent/child relationships to change the state from down to UNREACHABLE and that is something you can decide to not notify on...
I'm just brainstorming here on this, so it could be in far far left field.
You could use a event handler to aggregate the alerts and after a certain condition is met (lets say 20 or more services are in concurrent alert state), it will perform an ack on the alerts and a separate script (probably python) will take over alerting from nagios until recovery for these alerts. As recoveries take place, the event handler will remove it from the aggregate list. When conditions return to normal (less than 20 services in alert state), Nagios will be the sole notifier of all services.
We should be able to harvest notification settings from the objects.cache file and pull in additional information from the status.dat file.
This is just spit-balling an idea and exploring the possibilities =)
Re: Complete Off The Wall Notification Handler Idea
Posted: Tue Mar 25, 2014 1:22 pm
by tmcdonald
I'll spit-ball back I suppose.
You could do something similar with BPI
http://assets.nagios.com/downloads/nagi ... BPI_v2.pdf
Instead of aggregating emails, you make a meta-check basically that only alerts when there are, say, 10 or more critical issues within a group. Not 100% what you wanted but it came to mind first.
Re: Complete Off The Wall Notification Handler Idea
Posted: Wed Mar 26, 2014 10:24 am
by technick
I've gotta admit I am not very familiar with what BPI can do yet.
I'll have to investigate and get back to you on that =)