Page 1 of 1

Leading practice method to detect a down server/device

Posted: Thu May 17, 2018 12:48 pm
by awilson
Hi. This installation of Nagios was the first for our team. When we built out the hosts and services we configured ping for the host checks and also included it in the service checks for all devices and servers. We eventually learned that Nagios XI has smart methods to determine whether a server or device is still up by issuing a ping if it hasn't heard from the server in a while. We stopped adding the host checks, but still use the pings in the service checks. We've gone to measuring availability by confirming that a tcp port is working and that we can speak with the remote agent. We get a sense of latency from the tcp port checks and confirm that the server is at least working well enough for the agent to say that it is up.

Our service desk and operations people will sometimes disable the ping checks if a particular server/device consistently has ping timeouts while all of the other services checks and applications are operating well. The call them "false alerts" and resent having to follow procedure for a down device incident because Nagios is "broken." //fun times ... smile

All of that to ask, what are you really smart people doing to determine whether a device is up or down? Have you abandoned the notion of up/down and moved to some type of scale of overall service impairment by analyzing the complete state of the system? Are you using periodic "tcp ping" checks that confirm that a service in the system is able to respond? ... etc

Should we just stop using ping in checks?

Thanks!
Alan

Re: Leading practice method to detect a down server/device

Posted: Thu May 17, 2018 3:15 pm
by npolovenko
Hi, @ awilson. One way to prevent ping checks from timing out is to add a timeout value at the end of the command.

Code: Select all

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)
Maybe -t 30?

Another way to deal with this kind of issues is to set a Retry Interval.
Untitled.png
So the settings on the screenshot mean the following:
Perform this check every 5 minutes. If the check gets "Critical" check 4 more times before sending an email notification.
This prevents a false alert meaning that Nagios will first make sure(4 times) that the check is actually critical before notifying administrators.

Re: Leading practice method to detect a down server/device

Posted: Thu May 17, 2018 4:14 pm
by awilson
@npolovenko Thank you for the reply. We are using the timeout parameter in the check to extend the time. 30 or 60 seconds has not been a reliable solution across the board. It works for some and not for others. We've checked the performance logs of the switches between the Nagios servers and the remote server. We have varied the retry period and count as well. At some point we become blind to brief service disruptions. I guess if the end user is not affected, no harm no foul.

I guess the definitive answer of whether the server dropped the ICMP message and didn't reply would come from somehow scanning the server's behavior at the network interface and comparing it to what other activity happened at the same time.

Thanks!

Re: Leading practice method to detect a down server/device

Posted: Thu May 17, 2018 4:52 pm
by npolovenko
@awilson, If a server doesn't respond to a ping command in 30 seconds 4 times in a row I'd begin researching on what could be causing the package loss. I'd probably do a tracert or tcpdump to get more info about the network. And do what you suggested.
If the server happens to broadcast an HTTP web page we could switch the ping command with a check http command. So instead of pinging the network Nagios will try to open a web page.
By the way, what kind of servers are we talking about? Linux or Windows? Are you currently monitoring any other parameters on these servers or just 1 ping check for each server?

Re: Leading practice method to detect a down server/device

Posted: Fri May 18, 2018 10:48 am
by awilson
We are managing Linux, Windows, and AIX servers. I'll do some work on definitively identifying the cause of the packet loss. I found a Cisco article that references using mtr along with the other tools mentioned to get a more complete picture of where the loss occurs.

I guess ping is still the reigning connectivity verification champion ... //smile

Thanks. You can close this.

Alan

Re: Leading practice method to detect a down server/device

Posted: Fri May 18, 2018 11:10 am
by npolovenko
@awilson, Thanks for an update, Alan! I will be closing this thread but feel free to open a new one if needed.