check_freshness sends false positive alerts
Posted: Thu Jul 26, 2018 12:30 pm
I've been passive checking a network with Nagios Core. Some days ago, there was a power outage and since the machine that active checks the network was down as well, I got no "host down" alerts. That almost got me a real problem with clients. Since then, I've been trying to monitor these power outages with the check_freshness option, wish has been acting weirdly (or I've been figuring it all wrong).
I'm using check_freshness on the service Uptime, which is checked every minute. If I edit freshness_threshold = 5, I don't get a critical after 5 seconds of not receiving a new check; instead, I my Uptime service just stays critical for a lot of the time, displaying a (No output on stdout) stderr: execvp(a, ...) failed. errno is 2: No such file or directory message. I guessed it was very little time for the threshold, so I increased it to 70. Now, my Uptime service changes to critical after the machine is down for 2/3 minutes, but sometimes it displays a crit (and consequently it warns me over email) for apparently no reason. The message displayed is always the same, whether the machine is really down or it's a false positive.
Am I doing it wrong? How can I edit the passive check to display a critical whenever the Nagios machine on the network is also down, in the minimal time possible? Thank you.
I'm using check_freshness on the service Uptime, which is checked every minute. If I edit freshness_threshold = 5, I don't get a critical after 5 seconds of not receiving a new check; instead, I my Uptime service just stays critical for a lot of the time, displaying a (No output on stdout) stderr: execvp(a, ...) failed. errno is 2: No such file or directory message. I guessed it was very little time for the threshold, so I increased it to 70. Now, my Uptime service changes to critical after the machine is down for 2/3 minutes, but sometimes it displays a crit (and consequently it warns me over email) for apparently no reason. The message displayed is always the same, whether the machine is really down or it's a false positive.
Am I doing it wrong? How can I edit the passive check to display a critical whenever the Nagios machine on the network is also down, in the minimal time possible? Thank you.