Nagios 4 host/service recovery
Posted: Wed Dec 04, 2013 9:53 am
I'm configuring Nagios 4.0.2, and absolutely love that services no longer alert if their parent host is down. We have over 30 services on some systems, and receiving 30+1 notifications when a host was down was annoying, to say the least.
However Nagios 4 hasn't quite solved everything. When a host recovers (ping check says it's so), the services that are still marked as down on their next check (legitimately, they OS may not have gotten around to starting them, or the collectd/graphite data we're depending on hasn't made the full round trip yet, etc) I get alerts for the services that are still down. Is there a way to configure Nagios to give the system a chance to boot all the way before alerting on anything? Ideally I'd like to not notify on any services for a few minutes after the host check is successful.
An example timeline might be like this:
1:00 - host goes down, notification sent for host down.
1:01 - services set to CRITICAL, check 1/3
1:04 - all services are CRITICAL, check 3/3.
1:09 - notification_delay on services expires, but no notifications sent in v4 (yay!)
1:20 - host comes back, notification send for host recovery.
1:20:10 - check on service 1 runs, fails (still) due to operating system not fully up (or collectd/graphite data not available yet). Sends notification.
1:21:10 - check on service 1 succeeds, recovery notification sent.
It's that notification at 1:20:10 in my example I want subdued. Any ideas how?
thanks
mike
However Nagios 4 hasn't quite solved everything. When a host recovers (ping check says it's so), the services that are still marked as down on their next check (legitimately, they OS may not have gotten around to starting them, or the collectd/graphite data we're depending on hasn't made the full round trip yet, etc) I get alerts for the services that are still down. Is there a way to configure Nagios to give the system a chance to boot all the way before alerting on anything? Ideally I'd like to not notify on any services for a few minutes after the host check is successful.
An example timeline might be like this:
1:00 - host goes down, notification sent for host down.
1:01 - services set to CRITICAL, check 1/3
1:04 - all services are CRITICAL, check 3/3.
1:09 - notification_delay on services expires, but no notifications sent in v4 (yay!)
1:20 - host comes back, notification send for host recovery.
1:20:10 - check on service 1 runs, fails (still) due to operating system not fully up (or collectd/graphite data not available yet). Sends notification.
1:21:10 - check on service 1 succeeds, recovery notification sent.
It's that notification at 1:20:10 in my example I want subdued. Any ideas how?
thanks
mike