Page 1 of 1

ATI Radeon 6990 Temperature Monitoring Issue

Posted: Thu Apr 25, 2013 2:37 pm
by h3x
Hi everyone!

How can I specify that a plugin script needs to be run on a client machine and not the server?

Details:

I am working on configuring my Nagios Core server (CentOS) to monitor the temperatures of an ATI graphics card in one of the client systems (Ubuntu), but I am having some problems.
I am using the script from http://exchange.nagios.org/directory/Pl ... mp/details

On the Nagios Core console I receive the following Status Information for the service: "(No output returned from plugin) It appears you don't have aticonfig installed in /usr/bin/aticonfig"
This is the same output I get when I attempt to manually run the script on the server. This makes sense because the server doesn't have an ATI graphics card and doesn't have aticonfig installed.

I am able to manually run the script successfully on the client system and I have added it to the npre.cfg file (in case that is necessary):

Code: Select all

command[check_gputemp0]=/usr/lib/nagios/plugins/check_gputemp -w 90 -c 95
command[check_gputemp1]=/usr/lib/nagios/plugins/check_gputemp --adapter 1 -w 90 -c 95
The ATI graphics card is a 6990 so I am trying to monitor the temperatures of both GPUs. This is my commands.cfg file:

(As you can see from the commented out 'command_line' line in the file that I tried using the pull path to the location of the check_gputemp script as it is on the client machine. When using that command_line setting I get the Status Information: "Return code of 127 is out of bounds - plugin may be missing")

Code: Select all

# 'check_gpu0' command definition
define command{
        command_name    check_gputemp0
        #command_line   /usr/lib/nagios/plugins/check_gputemp -w $ARG1$ -c $ARG2$
        command_line    $USER1$/check_gputemp -w $ARG1$ -c $ARG2$
        }

# 'check_gpu1' command definition
define command{
        command_name    check_gputemp1
        #command_line    /usr/lib/nagios/plugins/check_gputemp --adapter 1 -w $ARG1$ -c $ARG2$
        command_line    $USER1$/check_gputemp --adapter 1 -w $ARG1$ -c $ARG2$
        }

and my linux.cfg:

Code: Select all

# Define a service to check the ATI GPU 0 temperature.
# Critical if >= 95 degrees Celcius, warning if >= 90 degrees Celcius

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       <host_name>
        service_description             GPU 0 Temperature
        check_command                   check_gputemp0!90!95
        }

# Define a service to check the ATI GPU 1 temperature.
# Critical if >= 95 degrees Celcius, warning if >= 90 degrees Celcius

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       <host_name>
        service_description             GPU 1 Temperature
        check_command                   check_gputemp1!90!95
        }
When I check the server configuration I get "Things look okay - No serious problems were detected during the pre-flight check"
All other services (e.g. SSH, PING, Current Load) on this client are working correctly.

----

Sorry for writing a book but if anyone is able to help out I would greatly appreciate it!

Re: ATI Radeon 6990 Temperature Monitoring Issue

Posted: Thu Apr 25, 2013 2:48 pm
by scottwilkerson
You want to change your service definitions to this

Code: Select all

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       <host_name>
        service_description             GPU 0 Temperature
        check_command                   check_nrpe!check_gputemp0
        }

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       <host_name>
        service_description             GPU 1 Temperature
        check_command                   check_nrpe!check_gputemp1
        }
This is assuming you have the following command specified

Code: Select all

define command {
       command_name                  		check_nrpe
       command_line                  		$USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ $ARG2$
}	

Re: ATI Radeon 6990 Temperature Monitoring Issue

Posted: Thu Apr 25, 2013 3:55 pm
by h3x
Thank you scottwilkerson!

It looks like that fixed it! I am now getting "GPU 0 OK - Temperature is " as the Status Information.

Sorry to ask another question but are you familiar with this plugin? Any idea why the temperature is not being displayed in the Status Information?

The script displays the following when run manually:

Code: Select all

GPU 0 OK - Temperature is 66 |             Sensor 0: Temperature - 66.00 C
The code for the output section:

Code: Select all

# And finally check the temperature against our thresholds
if [[ "$TEMP" -gt "$thresh_crit" ]]; then
        # Temperature is above critical threshold
        echo "GPU $adapter CRITICAL - Temperature is $TEMP | $PERFDATA"
        exit $STATE_CRITICAL

  elif [[ "$TEMP" -gt "$thresh_warn" ]]; then
        # Temperature is above warning threshold
        echo "GPU $adapter WARNING - Temperature is $TEMP | $PERFDATA"
        exit $STATE_WARNING

  else
        # Temperature is ok
        echo "GPU $adapter OK - Temperature is $TEMP | $PERFDATA"
        exit $STATE_OK
fi
Thanks again for your help!