Discussion about warning and critical for host alive checks

Post by **Box293** » Thu May 14, 2015 11:15 pm

For my box293_check_vmware plugin I've been working on adding a "host alive" check to replace the standard ping check. In my scenario it is to report an UP status when hosts are in Standby Mode, as this is something normal and expected and the standard ping check will not accommodate for this.

Anyways, as I've been coding away I've been reading different documentation about the exit code, specifically when used with a host object. This documentation dictates:
http://nagios.sourceforge.net/docs/3_0/ ... .html#host

check_command: This directive is used to specify the short name of the command that should be used to check if the host is up or down. Typically, this command would try and ping the host to see if it is "alive". The command must return a status of OK (0) or Nagios will assume the host is down.

So for an UP or STANDBY state I will return an exit code of 0. And then for a DOWN state then I would return an exit code of 1,2 or 3 or really any number that is not 0.

With SERVICES, 1=Warning, 2=Critical and 3=Unknown.

So this got me thinking about the ping check used to determine if a host is up or down. I had a look at a Core 4.0.8 box I have here and it provides some sample check commands, for example:

Code: Select all

define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
        }

In relation to a host UP or DOWN check, what is the point of a warning AND critical threshold. It's either up OR down. For host checks we should only be using one or the other. Technically there perhaps should be a -d argument instead of -w or -c as the output message that includes WARNING or CRITICAL is not correct for a Up/Down check.

I know from a performance data perspective we need warning and critical, but once again in relation to a host object there is no warning or critical.

When you look at the Tactical Overview screen, for hosts there is no Warning or Critical grouping, it's Down, Unreachable, Up and Pending.

Which makes me thing some more. In relation to the Unreachable grouping, does this only occur from within the Core logic in relation to parents, or is there an exit code you can use that defines the Unreachable state?

Like the topic says, this is just a discussion if people are interested in participating in.

millisa · Post by **millisa** » Mon May 18, 2015 8:03 pm

I'm not sure it changes your points at all, but the warning/unknown states on a host alive count towards whether a host is flapping. I could see the warning/unknown values still being wanted for a host alive check to try to pickup those states (so a host that goes ok-warn-ok-warn-ok-warn and never actually goes down completely might pass a simple pass/fail, and would never go into a flap state if we didn't have the warning thresholds).

Post by **Box293** » Tue May 19, 2015 1:50 am

Nice points about the flapping. I went and did some reading of it and have a more detailed understanding of it works. It's clear a lot of logic has been built into Nagios.

Thanks for the input, good discussion point

Nagios Support Forum

Discussion about warning and critical for host alive checks

Discussion about warning and critical for host alive checks

Re: Discussion about warning and critical for host alive che

Re: Discussion about warning and critical for host alive che