Nagiosgraph - check_gputemp not returning status correctly

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
h3x
Posts: 5
Joined: Thu Apr 25, 2013 1:55 pm

Nagiosgraph - check_gputemp not returning status correctly

Post by h3x »

Hi all!

I have been working on creating a modified version of the check_gputemp (http://exchange.nagios.org/directory/Pl ... mp/details) plugin but I am having problems with the status information it is returning.

(I am modifying the check_gputemp because I was having problems with the original running the aticonfig utility outside of the GUI session for the logged in user. Now I have a screen session that runs a script to collect the temperature every 5 minutes and write the output to a text file. My modified check_gputemp retrieves the temperature information from the text file. A formatting script is then suppose to format the output so it can be used by nagios and nagiosgraph.)

The nagios server is able to run the NRPE command on the client successfully but only shows the following status information:
GPU0 OK: degrees
For some reason the temperate is being striped out of the status.

This is returned when the script is run manually and what I am hoping to see from the server:
./check_ati-gpu-temp --adapter 0 -w 70 -c 90
GPU0 OK: 67 degrees
Modified version of the script (check_ati-gpu-temp):

Code: Select all

#!/bin/bash

# Script to scrape text file for the GPU temperature
# Modified Verison of check_gputemp plugin Version 1.3 by Jack-Benny Persson ([email protected])

#/usr/bin/aticonfig --adapter=0 --od-gettemperature | grep "Temperature" | awk '{print $5}' | cut -c1-2

VERSION="Version 1.0"
AUTHOR="Original by Jack-Benny Persson ([email protected] - Minor Changes by h3x"

# Exit codes
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3

shopt -s extglob

#### Functions ####

# Print version information
print_version()
{
        printf "\n\n$0 - $VERSION\n"
}

#Print help information
print_help()
{
        print_version
        printf "$AUTHOR\n"
        printf "Monitor GPU temperature with the use of aticonfig (fglrx)\n"
/bin/cat <<EOT

Options:
-h
   Print detailed help screen
-V
   Print version information
-v
   Verbose output

--adapter NUM
   Set which GPU adapter to monitor, for example 0 or 1. Default is 0

-w INTEGER
   Exit with WARNING status if above INTEGER degres
-c INTEGER
   Exit with CRITICAL status if above INTEGER degres
EOT
}

###### MAIN ########

# Temperature File (Contains Temperature output from check_temp.sh)
temp_file=/home/h3x/gpu_temps.txt
# Warning threshold
thresh_warn=
# Critical threshold
thresh_crit=
# Hardware to monitor
adapter=

# Parse command line options
while [[ -n "$1" ]]; do
   case "$1" in

       -h | --help)
           print_help
           exit $STATE_OK
           ;;

       -V | --version)
           print_version
           exit $STATE_OK
           ;;

       -v | --verbose)
           : $(( verbosity++ ))
           shift
           ;;

       -w | --warning)
           if [[ -z "$2" ]]; then
               # Threshold not provided
               printf "\nOption $1 requires an argument"
               print_help
               exit $STATE_UNKNOWN
            elif [[ "$2" = +([0-9]) ]]; then
               # Threshold is an integer
               thresh=$2
            else
               # Threshold is not an integer
               printf "\nThreshold must be an integer"
               print_help
               exit $STATE_UNKNOWN
           fi
           thresh_warn=$thresh
           shift 2
           ;;

       -c | --critical)
           if [[ -z "$2" ]]; then
               # Threshold not provided
               printf "\nOption '$1' requires an argument"
               print_help
               exit $STATE_UNKNOWN
            elif [[ "$2" = +([0-9]) ]]; then
               # Threshold is an integer
               thresh=$2
            else
               # Threshold is not an integer
               printf "\nThreshold must be an integer"
               print_help
               exit $STATE_UNKNOWN
           fi
           thresh_crit=$thresh
           shift 2
           ;;

       -\?)
           print_help
           exit $STATE_OK
           ;;

       --adapter)
           if [[ -z "$2" ]]; then
                printf "\nOption $1 requires an argument"
                print_help
                exit $STATE_UNKNOWN
           fi
                adapter=$2
           shift 2
           ;;

       *)
           printf "\nInvalid option '$1'"
           print_help
           exit $STATE_UNKNOWN
           ;;
   esac
done

# Check if a sensor were specified
if [[ -z "$adapter" ]]; then
        # No sensor to monitor were specified
        printf "\nNo sensor specified"
        print_help
        exit $STATE_UNKNOWN
fi

# Get the temperature
TEMP=`cat /home/h3x/gpu_temps.txt | grep "Adapter ${adapter}" | awk '{print $3}' | cut -c1-2`

# Check if the tresholds have been set correctly
if [[ -z "$thresh_warn" || -z "$thresh_crit" ]]; then
        # One or both thresholds were not specified
        printf "\nThreshold not set"
        print_help
        exit $STATE_UNKNOWN
  elif [[ "$thresh_crit" -lt "$thresh_warn" ]]; then
        # The warning threshold must be lower than the critical threshold
        printf "\nWarning temperature should be lower than critical"
        print_help
        exit $STATE_UNKNOWN
fi

# Verbose output
if [[ "$verbosity" -ge 2 ]]; then
   /bin/cat <<__EOT
Debugging information:
  Warning threshold: $thresh_warn
  Critical threshold: $thresh_crit
  Verbosity level: $verbosity
  Current GPU $adapter temperature: $TEMP
__EOT
printf "\n\n"
fi

# Form data for nagiosgraphing
#PERFDATA="temperature=${TEMP}"

# And finally check the temperature against our thresholds
if [[ "$TEMP" -gt "$thresh_crit" ]]; then
        # Temperature is above critical threshold
        echo "GPU$adapter CRITICAL: $TEMP degrees"
        exit $STATE_CRITICAL

  elif [[ "$TEMP" -gt "$thresh_warn" ]]; then
        # Temperature is above warning threshold
        echo "GPU$adapter WARNING: $TEMP degrees"
        exit $STATE_WARNING

  else
        # Temperature is ok
        echo "GPU$adapter OK: $TEMP degrees"
        exit $STATE_OK
fi
exit 3
I have NRPE configured to use a separate script whose purpose is to format the output for use in nagiosgraph. (I found this script through google while configuring the Total Processes plugin to return values that can be used by nagiosgraph. This script works correctly for the Total Processes plugin.)

Separate formatting script (check_ati-gpu-temp-graph):

Code: Select all

#!/bin/bash
LINE=`/usr/lib/nagios/plugins/check_ati-gpu-temp $*`
RC=$?
COUNT=`echo $LINE | awk '{print $3}'`
DEGREES=`expr $COUNT - 1`
LINE=`echo $LINE | sed "s/: $COUNT /: $DEGREES /"`
echo $LINE \| degrees=$DEGREES
exit $RC
This is returned from the formatting script when run manually:
./check_ati-gpu-temp-graph --adapter 0 -w 70 -c 90
GPU0 OK: 65 degrees | degrees=65
When I try to view the graph for the GPU Temperature service:
No Data Available: <host> service=GPU 1 Temperature db=
This is applicable part of the nrpe.cfg file on the client:

Code: Select all

command[check_atigputemp0]=/usr/lib/nagios/plugins/check_ati-gpu-temp-graph --adapter 0 -w 75 -c 90
command[check_atigputemp1]=/usr/lib/nagios/plugins/check_ati-gpu-temp-graph --adapter 1 -w 75 -c 90
On the server I have the following service defined:

Code: Select all

define service{
        use                             generic-service,graphed-service         ; Name of service template to use
        host_name                       <host>
        service_description             GPU 0 Temperature
        check_command                   check_nrpe!check_atigputemp0
        }
I believe I have nagiosgraph setup correctly because it is generating graphs for about a dozen other services on various hosts correctly. This probably isn't even remotely the best way to check the temperatures and enable reporting but I think it should be working as the scripts return good information when run manually. Any help would be greatly appreciated! Thanks!
User avatar
gshergill
Posts: 231
Joined: Tue Aug 07, 2012 5:08 am

Re: Nagiosgraph - check_gputemp not returning status correct

Post by gshergill »

Hi h3x,

When you run the command below from the command line from the plugin directory on the Nagios server what is the output?

Code: Select all

./check_nrpe -H <remote server host> -c check_atigputemp0
Thank you.

Kind Regards,

Gary Shergill
h3x
Posts: 5
Joined: Thu Apr 25, 2013 1:55 pm

Re: Nagiosgraph - check_gputemp not returning status correct

Post by h3x »

Hi Gary!

Thanks for the reply! When I run the command on the server I get:
./check_nrpe -H <host_IP> -c check_atigputemp0
GPU0 OK: degrees | degrees=
Looks like that is the problem. Is there a way I need to configure nrpe and/or the server to send/receive the temperature information?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagiosgraph - check_gputemp not returning status correct

Post by abrist »

Do me favor on the remote host -run the plugin as user "nagios" from the cli.

Code: Select all

su nagios
cd /usr/lib/nagios/plugins/
./check_ati-gpu-temp-graph --adapter 0 -w 75 -c 90
Does this work? if not, then it may be due to the plugin requiring root privileges.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
h3x
Posts: 5
Joined: Thu Apr 25, 2013 1:55 pm

Re: Nagiosgraph - check_gputemp not returning status correct

Post by h3x »

When I ran that command I received this message:
sudo -u nagios ./check_ati-gpu-temp-graph --adapter 0 -w 75 -c 90
cat: /home/h3x/gpu_temps.txt: Permission denied
expr: non-integer argument
GPU0 OK: degrees | degrees=
I moved the file to /tmp/ and changed my scripts to point to that location and the nagios site is now displaying the desired status!
GPU0 OK: 65 degrees
Graphing is now working too!

I feel pretty silly that my problem was a permissions error... Thank you very much for your help!!!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagiosgraph - check_gputemp not returning status correct

Post by abrist »

Big bad POSIX gets us all occasionally.

You could also try sticking your txt file in /home/nagios/ to bypass this problem as well.

We are glad you got your issues resolved. You may find hints concerning future nrpe issues in the following document:
http://library.nagios.com/library/produ ... -solutions

Locking thread, have a great week.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked