Complete Off The Wall Notification Handler Idea

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
technick
Posts: 49
Joined: Tue Feb 04, 2014 10:30 am
Location: Denver, CO

Complete Off The Wall Notification Handler Idea

Post by technick »

Over the weekend I had a situation with a data center that went offline causing Nagios to flip out on a scale never seen before. Normally, when things go south, someone should get a page and ack / schedule downtime for all hosts / services that are broken. That didn't happen in my case as my monitoring infrastructure couldn't get pages out to the on call engineer but didn't have problems flooding everybodies inbox and opening 9000+ tickets. My services are configured to re-notify every 5 minutes on critical state, which IMHO is fair, especially for the line of work my business is in. This event inspired an idea which has haunted me all weekend.

I believe it would be possible to build a custom notification handler that did some logic work before sending the notification alert out. Using my weekend issues as an example, having a separate alert email for each service might be considered overkill, especially in my case it was several hundred something services in critical condition. It would be nice if we could do some logic and roll up all of these service alerts into a single alert email.

If a feature like this would be a major benefit to you or other users, lets get together collaborate on this. :D
----------------------
Nagios Jedi in training.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Complete Off The Wall Notification Handler Idea

Post by scottwilkerson »

You aren't the first to ask for this, but the only way I see it as possible would be to force a delay in notifications because if they are sent real-time, there would be no way to aggregate them...

As it is, with a default setup, if you use a retry_interval, if a host goes down it should only send the host notification, and not one for every service. Additionally, you cna utilize parent/child relationships to change the state from down to UNREACHABLE and that is something you can decide to not notify on...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
technick
Posts: 49
Joined: Tue Feb 04, 2014 10:30 am
Location: Denver, CO

Re: Complete Off The Wall Notification Handler Idea

Post by technick »

scottwilkerson wrote:You aren't the first to ask for this, but the only way I see it as possible would be to force a delay in notifications because if they are sent real-time, there would be no way to aggregate them...

As it is, with a default setup, if you use a retry_interval, if a host goes down it should only send the host notification, and not one for every service. Additionally, you cna utilize parent/child relationships to change the state from down to UNREACHABLE and that is something you can decide to not notify on...
I'm just brainstorming here on this, so it could be in far far left field.

You could use a event handler to aggregate the alerts and after a certain condition is met (lets say 20 or more services are in concurrent alert state), it will perform an ack on the alerts and a separate script (probably python) will take over alerting from nagios until recovery for these alerts. As recoveries take place, the event handler will remove it from the aggregate list. When conditions return to normal (less than 20 services in alert state), Nagios will be the sole notifier of all services.

We should be able to harvest notification settings from the objects.cache file and pull in additional information from the status.dat file.

This is just spit-balling an idea and exploring the possibilities =)
----------------------
Nagios Jedi in training.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Complete Off The Wall Notification Handler Idea

Post by tmcdonald »

I'll spit-ball back I suppose.

You could do something similar with BPI

http://assets.nagios.com/downloads/nagi ... BPI_v2.pdf

Instead of aggregating emails, you make a meta-check basically that only alerts when there are, say, 10 or more critical issues within a group. Not 100% what you wanted but it came to mind first.
Former Nagios employee
technick
Posts: 49
Joined: Tue Feb 04, 2014 10:30 am
Location: Denver, CO

Re: Complete Off The Wall Notification Handler Idea

Post by technick »

I've gotta admit I am not very familiar with what BPI can do yet.

I'll have to investigate and get back to you on that =)
----------------------
Nagios Jedi in training.
Locked