Recovery Latency

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Recovery Latency

Post by jbennett »

I'm noticing some issues on our instance where hosts and services will show down almost instantly (which is as expected). The issue is that once the issue is resolved, sometimes it will take hours to show as recovered.

Take simple ping on a camera for instance.

These were setup without any check intervals on the camera, but rather through a template.

However, that template doesn't have check intervals specifically set, only a check period. That template has another template applied to it. It's this second template that has the check intervals assigned.

Is this a correct way of using tmeplates? Or should I look at revamping this?
User avatar
CGraham
Posts: 115
Joined: Tue Aug 16, 2011 2:43 pm

Re: Recovery Latency

Post by CGraham »

Re: templates. You'll want to set the values in the template that you'd like to be inherited by the services attached to the template. I try to set them as completely as possible so I get less unexpected results.

Re: recovery times. I would think this would be based on your "retry interval" since Nagios uses this interval (instead of check interval) after state change is detected. What is your retry interval?

The other issue I've seen with long recoveries is the host or service is flapping and you don't get the recovery until it settles down (which could be hours).
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Recovery Latency

Post by jbennett »

Here's how the person before me set it up:

Actual Host in Nagios doesn't have any check sor alert settings. The host has a template assigned to it: xiwizard_genericnetdevice_host

When I go to that template, I see the following:

Additional Templates - xiwizard_generic_host
Under check settings:
  • Check period - 24x7
  • Freshness threshold - 1800
  • Event Handler - host-notify-by-email
When I go to the xiwizard_generic_host template, I see the following:
Additional Templates - none
Under check settings:
  • Max. Check attempts - 5
  • Retry interval - 1
  • Check interval - 5
  • Event handler - host-notify-by-email
When these items are still showing down, they aren't showing as flapping. They also don't alert via email as flapping. They just still show down.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Recovery Latency

Post by scottwilkerson »

Flapping requires certain scenario before it would go into a flapping state. Additionally, do you have flap detection enabled?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Recovery Latency

Post by jbennett »

Yes, flapping detection is enabled on the xiwizard_generic_host template, but not on the xiwizard_genericnetdevice_host template:

Check Settings:
  • Flap detection enabled - On
  • Retain status information - On
  • Retain non-status information - On
  • Process perf data - On
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Recovery Latency

Post by scottwilkerson »

Are you sure the host was in a flapping state (UP DOWN UP DOWN UP DOWN etc), or is it just down?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
455157
Posts: 51
Joined: Mon Sep 10, 2012 7:35 pm

Re: Recovery Latency

Post by 455157 »

If you go to the Service Detail for one of the services in question and "Schedule and Immediate Check", does the status remain bad or refresh as OK?
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Recovery Latency

Post by jbennett »

scottwilkerson wrote:Are you sure the host was in a flapping state (UP DOWN UP DOWN UP DOWN etc), or is it just down?
It was just down. It was not showing as flapping in Nagios. This is on more than one host for what it's worth and it's been happening for a while now.
455157 wrote:If you go to the Service Detail for one of the services in question and "Schedule and Immediate Check", does the status remain bad or refresh as OK?
In the past, when I've done this, it hasn't changed status, even though I can ping it from Nagios as well as from my desktop.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Recovery Latency

Post by scottwilkerson »

What version of Nagios XI are you using?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Recovery Latency

Post by jbennett »

Unfortunately, I'm still on Nagios XI 2011R2.3.

Not being able to get out past our proxy to update has made things quite difficult in the updating department.

Is this something that has been improved upon?
Locked