scottwilkerson wrote:You aren't the first to ask for this, but the only way I see it as possible would be to force a delay in notifications because if they are sent real-time, there would be no way to aggregate them...
As it is, with a default setup, if you use a retry_interval, if a host goes down it should only send the host notification, and not one for every service. Additionally, you cna utilize parent/child relationships to change the state from down to UNREACHABLE and that is something you can decide to not notify on...
I'm just brainstorming here on this, so it could be in far far left field.
You could use a event handler to aggregate the alerts and after a certain condition is met (lets say 20 or more services are in concurrent alert state), it will perform an ack on the alerts and a separate script (probably python) will take over alerting from nagios until recovery for these alerts. As recoveries take place, the event handler will remove it from the aggregate list. When conditions return to normal (less than 20 services in alert state), Nagios will be the sole notifier of all services.
We should be able to harvest notification settings from the objects.cache file and pull in additional information from the status.dat file.
This is just spit-balling an idea and exploring the possibilities =)