I have been working on creating a modified version of the check_gputemp (http://exchange.nagios.org/directory/Pl ... mp/details) plugin but I am having problems with the status information it is returning.
(I am modifying the check_gputemp because I was having problems with the original running the aticonfig utility outside of the GUI session for the logged in user. Now I have a screen session that runs a script to collect the temperature every 5 minutes and write the output to a text file. My modified check_gputemp retrieves the temperature information from the text file. A formatting script is then suppose to format the output so it can be used by nagios and nagiosgraph.)
The nagios server is able to run the NRPE command on the client successfully but only shows the following status information:
For some reason the temperate is being striped out of the status.GPU0 OK: degrees
This is returned when the script is run manually and what I am hoping to see from the server:
Modified version of the script (check_ati-gpu-temp):./check_ati-gpu-temp --adapter 0 -w 70 -c 90
GPU0 OK: 67 degrees
Code: Select all
#!/bin/bash
# Script to scrape text file for the GPU temperature
# Modified Verison of check_gputemp plugin Version 1.3 by Jack-Benny Persson ([email protected])
#/usr/bin/aticonfig --adapter=0 --od-gettemperature | grep "Temperature" | awk '{print $5}' | cut -c1-2
VERSION="Version 1.0"
AUTHOR="Original by Jack-Benny Persson ([email protected] - Minor Changes by h3x"
# Exit codes
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
shopt -s extglob
#### Functions ####
# Print version information
print_version()
{
printf "\n\n$0 - $VERSION\n"
}
#Print help information
print_help()
{
print_version
printf "$AUTHOR\n"
printf "Monitor GPU temperature with the use of aticonfig (fglrx)\n"
/bin/cat <<EOT
Options:
-h
Print detailed help screen
-V
Print version information
-v
Verbose output
--adapter NUM
Set which GPU adapter to monitor, for example 0 or 1. Default is 0
-w INTEGER
Exit with WARNING status if above INTEGER degres
-c INTEGER
Exit with CRITICAL status if above INTEGER degres
EOT
}
###### MAIN ########
# Temperature File (Contains Temperature output from check_temp.sh)
temp_file=/home/h3x/gpu_temps.txt
# Warning threshold
thresh_warn=
# Critical threshold
thresh_crit=
# Hardware to monitor
adapter=
# Parse command line options
while [[ -n "$1" ]]; do
case "$1" in
-h | --help)
print_help
exit $STATE_OK
;;
-V | --version)
print_version
exit $STATE_OK
;;
-v | --verbose)
: $(( verbosity++ ))
shift
;;
-w | --warning)
if [[ -z "$2" ]]; then
# Threshold not provided
printf "\nOption $1 requires an argument"
print_help
exit $STATE_UNKNOWN
elif [[ "$2" = +([0-9]) ]]; then
# Threshold is an integer
thresh=$2
else
# Threshold is not an integer
printf "\nThreshold must be an integer"
print_help
exit $STATE_UNKNOWN
fi
thresh_warn=$thresh
shift 2
;;
-c | --critical)
if [[ -z "$2" ]]; then
# Threshold not provided
printf "\nOption '$1' requires an argument"
print_help
exit $STATE_UNKNOWN
elif [[ "$2" = +([0-9]) ]]; then
# Threshold is an integer
thresh=$2
else
# Threshold is not an integer
printf "\nThreshold must be an integer"
print_help
exit $STATE_UNKNOWN
fi
thresh_crit=$thresh
shift 2
;;
-\?)
print_help
exit $STATE_OK
;;
--adapter)
if [[ -z "$2" ]]; then
printf "\nOption $1 requires an argument"
print_help
exit $STATE_UNKNOWN
fi
adapter=$2
shift 2
;;
*)
printf "\nInvalid option '$1'"
print_help
exit $STATE_UNKNOWN
;;
esac
done
# Check if a sensor were specified
if [[ -z "$adapter" ]]; then
# No sensor to monitor were specified
printf "\nNo sensor specified"
print_help
exit $STATE_UNKNOWN
fi
# Get the temperature
TEMP=`cat /home/h3x/gpu_temps.txt | grep "Adapter ${adapter}" | awk '{print $3}' | cut -c1-2`
# Check if the tresholds have been set correctly
if [[ -z "$thresh_warn" || -z "$thresh_crit" ]]; then
# One or both thresholds were not specified
printf "\nThreshold not set"
print_help
exit $STATE_UNKNOWN
elif [[ "$thresh_crit" -lt "$thresh_warn" ]]; then
# The warning threshold must be lower than the critical threshold
printf "\nWarning temperature should be lower than critical"
print_help
exit $STATE_UNKNOWN
fi
# Verbose output
if [[ "$verbosity" -ge 2 ]]; then
/bin/cat <<__EOT
Debugging information:
Warning threshold: $thresh_warn
Critical threshold: $thresh_crit
Verbosity level: $verbosity
Current GPU $adapter temperature: $TEMP
__EOT
printf "\n\n"
fi
# Form data for nagiosgraphing
#PERFDATA="temperature=${TEMP}"
# And finally check the temperature against our thresholds
if [[ "$TEMP" -gt "$thresh_crit" ]]; then
# Temperature is above critical threshold
echo "GPU$adapter CRITICAL: $TEMP degrees"
exit $STATE_CRITICAL
elif [[ "$TEMP" -gt "$thresh_warn" ]]; then
# Temperature is above warning threshold
echo "GPU$adapter WARNING: $TEMP degrees"
exit $STATE_WARNING
else
# Temperature is ok
echo "GPU$adapter OK: $TEMP degrees"
exit $STATE_OK
fi
exit 3
Separate formatting script (check_ati-gpu-temp-graph):
Code: Select all
#!/bin/bash
LINE=`/usr/lib/nagios/plugins/check_ati-gpu-temp $*`
RC=$?
COUNT=`echo $LINE | awk '{print $3}'`
DEGREES=`expr $COUNT - 1`
LINE=`echo $LINE | sed "s/: $COUNT /: $DEGREES /"`
echo $LINE \| degrees=$DEGREES
exit $RC
When I try to view the graph for the GPU Temperature service:./check_ati-gpu-temp-graph --adapter 0 -w 70 -c 90
GPU0 OK: 65 degrees | degrees=65
This is applicable part of the nrpe.cfg file on the client:No Data Available: <host> service=GPU 1 Temperature db=
Code: Select all
command[check_atigputemp0]=/usr/lib/nagios/plugins/check_ati-gpu-temp-graph --adapter 0 -w 75 -c 90
command[check_atigputemp1]=/usr/lib/nagios/plugins/check_ati-gpu-temp-graph --adapter 1 -w 75 -c 90
Code: Select all
define service{
use generic-service,graphed-service ; Name of service template to use
host_name <host>
service_description GPU 0 Temperature
check_command check_nrpe!check_atigputemp0
}