Page 1 of 2

Recovery Latency

Posted: Wed Sep 26, 2012 9:32 am
by jbennett
I'm noticing some issues on our instance where hosts and services will show down almost instantly (which is as expected). The issue is that once the issue is resolved, sometimes it will take hours to show as recovered.

Take simple ping on a camera for instance.

These were setup without any check intervals on the camera, but rather through a template.

However, that template doesn't have check intervals specifically set, only a check period. That template has another template applied to it. It's this second template that has the check intervals assigned.

Is this a correct way of using tmeplates? Or should I look at revamping this?

Re: Recovery Latency

Posted: Wed Sep 26, 2012 10:01 am
by CGraham
Re: templates. You'll want to set the values in the template that you'd like to be inherited by the services attached to the template. I try to set them as completely as possible so I get less unexpected results.

Re: recovery times. I would think this would be based on your "retry interval" since Nagios uses this interval (instead of check interval) after state change is detected. What is your retry interval?

The other issue I've seen with long recoveries is the host or service is flapping and you don't get the recovery until it settles down (which could be hours).

Re: Recovery Latency

Posted: Wed Sep 26, 2012 11:22 am
by jbennett
Here's how the person before me set it up:

Actual Host in Nagios doesn't have any check sor alert settings. The host has a template assigned to it: xiwizard_genericnetdevice_host

When I go to that template, I see the following:

Additional Templates - xiwizard_generic_host
Under check settings:
  • Check period - 24x7
  • Freshness threshold - 1800
  • Event Handler - host-notify-by-email
When I go to the xiwizard_generic_host template, I see the following:
Additional Templates - none
Under check settings:
  • Max. Check attempts - 5
  • Retry interval - 1
  • Check interval - 5
  • Event handler - host-notify-by-email
When these items are still showing down, they aren't showing as flapping. They also don't alert via email as flapping. They just still show down.

Re: Recovery Latency

Posted: Wed Sep 26, 2012 5:11 pm
by scottwilkerson
Flapping requires certain scenario before it would go into a flapping state. Additionally, do you have flap detection enabled?

Re: Recovery Latency

Posted: Thu Sep 27, 2012 9:05 am
by jbennett
Yes, flapping detection is enabled on the xiwizard_generic_host template, but not on the xiwizard_genericnetdevice_host template:

Check Settings:
  • Flap detection enabled - On
  • Retain status information - On
  • Retain non-status information - On
  • Process perf data - On

Re: Recovery Latency

Posted: Thu Sep 27, 2012 5:12 pm
by scottwilkerson
Are you sure the host was in a flapping state (UP DOWN UP DOWN UP DOWN etc), or is it just down?

Re: Recovery Latency

Posted: Fri Sep 28, 2012 5:24 pm
by 455157
If you go to the Service Detail for one of the services in question and "Schedule and Immediate Check", does the status remain bad or refresh as OK?

Re: Recovery Latency

Posted: Mon Oct 01, 2012 8:57 am
by jbennett
scottwilkerson wrote:Are you sure the host was in a flapping state (UP DOWN UP DOWN UP DOWN etc), or is it just down?
It was just down. It was not showing as flapping in Nagios. This is on more than one host for what it's worth and it's been happening for a while now.
455157 wrote:If you go to the Service Detail for one of the services in question and "Schedule and Immediate Check", does the status remain bad or refresh as OK?
In the past, when I've done this, it hasn't changed status, even though I can ping it from Nagios as well as from my desktop.

Re: Recovery Latency

Posted: Mon Oct 01, 2012 9:16 am
by scottwilkerson
What version of Nagios XI are you using?

Re: Recovery Latency

Posted: Mon Oct 01, 2012 9:53 am
by jbennett
Unfortunately, I'm still on Nagios XI 2011R2.3.

Not being able to get out past our proxy to update has made things quite difficult in the updating department.

Is this something that has been improved upon?