Leading practice method to detect a down server/device
Posted: Thu May 17, 2018 12:48 pm
Hi. This installation of Nagios was the first for our team. When we built out the hosts and services we configured ping for the host checks and also included it in the service checks for all devices and servers. We eventually learned that Nagios XI has smart methods to determine whether a server or device is still up by issuing a ping if it hasn't heard from the server in a while. We stopped adding the host checks, but still use the pings in the service checks. We've gone to measuring availability by confirming that a tcp port is working and that we can speak with the remote agent. We get a sense of latency from the tcp port checks and confirm that the server is at least working well enough for the agent to say that it is up.
Our service desk and operations people will sometimes disable the ping checks if a particular server/device consistently has ping timeouts while all of the other services checks and applications are operating well. The call them "false alerts" and resent having to follow procedure for a down device incident because Nagios is "broken." //fun times ... smile
All of that to ask, what are you really smart people doing to determine whether a device is up or down? Have you abandoned the notion of up/down and moved to some type of scale of overall service impairment by analyzing the complete state of the system? Are you using periodic "tcp ping" checks that confirm that a service in the system is able to respond? ... etc
Should we just stop using ping in checks?
Thanks!
Alan
Our service desk and operations people will sometimes disable the ping checks if a particular server/device consistently has ping timeouts while all of the other services checks and applications are operating well. The call them "false alerts" and resent having to follow procedure for a down device incident because Nagios is "broken." //fun times ... smile
All of that to ask, what are you really smart people doing to determine whether a device is up or down? Have you abandoned the notion of up/down and moved to some type of scale of overall service impairment by analyzing the complete state of the system? Are you using periodic "tcp ping" checks that confirm that a service in the system is able to respond? ... etc
Should we just stop using ping in checks?
Thanks!
Alan