Page 1 of 1

Suppressing "timed out" alerts

Posted: Thu Apr 10, 2014 8:53 am
by cunningrat
We have an issue in a client environment. They get CPU spikes, which cause Nagios checks to fail with the "Plugin timed out while executing system call". I've increased the timeout value as much as I am comfortable with, but that still occurs.

I am aware that fixing the server is the preferred solution, but as I said, that's a client environment, so fixing the server may not happen.

Is there a way to make Nagios suppress the alert if the message says "Plugin timed out"?

Re: Suppressing "timed out" alerts

Posted: Thu Apr 10, 2014 11:12 am
by abrist
How is this check performed? an you post the full check command? Depending on the plugin, you may be able to set a an option to do so, or create a wrapper script.

Re: Suppressing "timed out" alerts

Posted: Thu Apr 10, 2014 11:28 am
by cunningrat
abrist wrote:How is this check performed? an you post the full check command? Depending on the plugin, you may be able to set a an option to do so, or create a wrapper script.
All of the checks are performed via check_by_ssh: the plugins hit on the client side are mostly default Nagios ones, or home-grown perl scripts.
Here's a representative example:
$USER1$/check_by_ssh -H $HOSTADDRESS$ -t 45 -C "/home/nagios/scripts/ready/check_disk -w 10% -c 5% -W 40% -K 30% -p /exe_prd/temp"

I didn't find any appropriate flags in the check_by_ssh documentation.

Re: Suppressing "timed out" alerts

Posted: Thu Apr 10, 2014 1:03 pm
by cunningrat
I saw the -u flag in check_nrpe. Pity I'm not using check_nrpe.

I'm going to go post in the suggestions forum about adding a flag with that functionality to check_by_ssh. :)

Re: Suppressing "timed out" alerts

Posted: Fri Apr 11, 2014 9:54 am
by abrist
cunningrat wrote: I'm going to go post in the suggestions forum about adding a flag with that functionality to check_by_ssh. :)
For now, you may just want to create a wrapper script that runs your check and saves the exit code and status/perf string. Check for "CRITICAL - Plugin timed out after" in the status string, if it matches, replace "CRITICAL" with "WARNING", "UNKNOWN" or "OK" and exit with the new respective exit. Otherwise, just return the status/perf string and keep the original exit code. For example:
Command:

Code: Select all

$USER1$/check_by_ssh_custom.sh "$HOSTADDRESS$" $ARG1$ "$ARG2$"
check_by_ssh_custom.sh:

Code: Select all

#!/bin/bash
HOST=$1
TIMEOUT=$2
COMMAND=$3

OUTPUT=$(/usr/local/nagios/libexec/check_by_ssh -H "$HOST" -t $TIMEOUT -C "$COMMAND")
EXIT=$(echo $?)
if $(echo "$OUTPUT" | grep -q  "CRITICAL - Plugin timed out after");then
    OUTPUT=$(echo "$OUTPUT" | sed 's/CRITICAL/UNKNOWN/g')
    echo "$OUTPUT"
    exit 3
else
    echo "$OUTPUT"
    exit $EXIT
fi
Note: The above is just an example, I have not even tested this script.

Re: Suppressing "timed out" alerts

Posted: Mon Apr 14, 2014 3:10 pm
by cunningrat
I'll try that, abrist. Thanks!

Re: Suppressing "timed out" alerts

Posted: Mon Apr 14, 2014 4:18 pm
by abrist
Alright! Let me know if you have issues. Happy scripting!