Nagios 4 host/service recovery

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
mdhart
Posts: 2
Joined: Thu Nov 28, 2013 8:46 am

Nagios 4 host/service recovery

Post by mdhart »

I'm configuring Nagios 4.0.2, and absolutely love that services no longer alert if their parent host is down. We have over 30 services on some systems, and receiving 30+1 notifications when a host was down was annoying, to say the least.

However Nagios 4 hasn't quite solved everything. When a host recovers (ping check says it's so), the services that are still marked as down on their next check (legitimately, they OS may not have gotten around to starting them, or the collectd/graphite data we're depending on hasn't made the full round trip yet, etc) I get alerts for the services that are still down. Is there a way to configure Nagios to give the system a chance to boot all the way before alerting on anything? Ideally I'd like to not notify on any services for a few minutes after the host check is successful.

An example timeline might be like this:

1:00 - host goes down, notification sent for host down.
1:01 - services set to CRITICAL, check 1/3
1:04 - all services are CRITICAL, check 3/3.
1:09 - notification_delay on services expires, but no notifications sent in v4 (yay!)
1:20 - host comes back, notification send for host recovery.
1:20:10 - check on service 1 runs, fails (still) due to operating system not fully up (or collectd/graphite data not available yet). Sends notification.
1:21:10 - check on service 1 succeeds, recovery notification sent.

It's that notification at 1:20:10 in my example I want subdued. Any ideas how?

thanks
mike
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios 4 host/service recovery

Post by slansing »

One thing you could do, especially if you are getting a socket timeout return is to extend the timeout to a reasonable level by passing the "-t" flag. Most plugins have this ability, so you would define an NRPE check "for example" like so:

Code: Select all

$USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c check_disk -a "-w 10 -c 30"
mdhart
Posts: 2
Joined: Thu Nov 28, 2013 8:46 am

Re: Nagios 4 host/service recovery

Post by mdhart »

Digging into this a bit more, the NRPE checks are behaving with timeout set to a reasonable number, like the example slansing showed.

What's causing the grief are my checks that hit graphite. I have a lot of collectd checks pushing data to graphite, and a check_graphite script pulling statistics. So what happens is the host comes up, but the collectd data hasn't made it to graphite yet, and then the check_graphite check then fails (as it has since the hos t was down.

The only thing I can think of is putting an event handler in place on the host that triggers when a host comes up, that somehow ??? suppresses notifications on the services for that host for 1-2 minutes. But that feels kludgy to me.

Thoughts?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios 4 host/service recovery

Post by slansing »

Is there any way you can extend the graphite check out a bit more?
Locked