Nagios 4 host/service recovery

mdhart · Post by **mdhart** » Wed Dec 04, 2013 9:53 am

I'm configuring Nagios 4.0.2, and absolutely love that services no longer alert if their parent host is down. We have over 30 services on some systems, and receiving 30+1 notifications when a host was down was annoying, to say the least.

However Nagios 4 hasn't quite solved everything. When a host recovers (ping check says it's so), the services that are still marked as down on their next check (legitimately, they OS may not have gotten around to starting them, or the collectd/graphite data we're depending on hasn't made the full round trip yet, etc) I get alerts for the services that are still down. Is there a way to configure Nagios to give the system a chance to boot all the way before alerting on anything? Ideally I'd like to not notify on any services for a few minutes after the host check is successful.

An example timeline might be like this:

1:00 - host goes down, notification sent for host down.
1:01 - services set to CRITICAL, check 1/3
1:04 - all services are CRITICAL, check 3/3.
1:09 - notification_delay on services expires, but no notifications sent in v4 (yay!)
1:20 - host comes back, notification send for host recovery.
1:20:10 - check on service 1 runs, fails (still) due to operating system not fully up (or collectd/graphite data not available yet). Sends notification.
1:21:10 - check on service 1 succeeds, recovery notification sent.

It's that notification at 1:20:10 in my example I want subdued. Any ideas how?

thanks
mike

slansing · Post by **slansing** » Wed Dec 04, 2013 11:53 am

One thing you could do, especially if you are getting a socket timeout return is to extend the timeout to a reasonable level by passing the "-t" flag. Most plugins have this ability, so you would define an NRPE check "for example" like so:

Code: Select all

$USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c check_disk -a "-w 10 -c 30"

mdhart · Post by **mdhart** » Thu Dec 05, 2013 12:17 pm

Digging into this a bit more, the NRPE checks are behaving with timeout set to a reasonable number, like the example slansing showed.

What's causing the grief are my checks that hit graphite. I have a lot of collectd checks pushing data to graphite, and a check_graphite script pulling statistics. So what happens is the host comes up, but the collectd data hasn't made it to graphite yet, and then the check_graphite check then fails (as it has since the hos t was down.

The only thing I can think of is putting an event handler in place on the host that triggers when a host comes up, that somehow ??? suppresses notifications on the services for that host for 1-2 minutes. But that feels kludgy to me.

Thoughts?

slansing · Post by **slansing** » Thu Dec 05, 2013 6:01 pm

Is there any way you can extend the graphite check out a bit more?

Nagios Support Forum

Nagios 4 host/service recovery

Nagios 4 host/service recovery

Re: Nagios 4 host/service recovery

Re: Nagios 4 host/service recovery

Re: Nagios 4 host/service recovery