Page 1 of 2

pnp4nagios hurts my head

Posted: Mon Jul 07, 2014 2:45 pm
by brandon.pal
I'm having some graph issues....

I've created a custom graph for my check_cpu plugin.

Template looks good. Perf data looks good. Graph on my screen looks to be the default graph and not my custom graph.

templates/check_cpu.php

Code: Select all

<?php

$opt[1] = "--vertical-label \"% Used\" -l0 --title \"CPU for $hostname / $servicedesc\" ";

# Memory Used
$def[1] = "DEF:load=$rrdfile:$DS[1]:AVERAGE " ;
$def[1] .= "AREA:load#00CC00:\"Load \" ";
$def[1] .= "GPRINT:load:LAST:\"%3.2lf %sB CUR \" ";
$def[1] .= "GPRINT:load:MAX:\"%3.2lf %sB MAX \" ";
$def[1] .= "GPRINT:load" . ':AVERAGE:"%3.2lf %sB AVG \j" ';

$opt[2] = "--vertical-label \"% Used\" -l0 --title \"CPU usage for $hostname / $servicedesc\" ";
# CPU
$def[2] = "DEF:cpuSys=$rrdfile:$DS[3]:AVERAGE " ;
$def[2] .= "AREA:cpuSys#FF0000:\"System \" ";
$def[2] .= "GPRINT:cpuSys:LAST:\"%3.2lf %sB CUR \" ";
$def[2] .= "GPRINT:cpuSys:MAX:\"%3.2lf %sB MAX \" ";
$def[2] .= "GPRINT:cpuSys" . ':AVERAGE:"%3.2lf %sB AVG \j" ';

$def[2] .= "DEF:cpuUser=$rrdfile:$DS[2]:AVERAGE " ;
$def[2] .= "AREA:cpuUser#FFFF00:\"User \" ";
$def[2] .= "GPRINT:cpuUser:LAST:\"%3.2lf %sB CUR \" ";
$def[2] .= "GPRINT:cpuUser:MAX:\"%3.2lf %sB MAX \" ";
$def[2] .= "GPRINT:cpuUser" . ':AVERAGE:"%3.2lf %sB AVG \j" ';

$def[2] .= "DEF:cpuIdle=$rrdfile:$DS[4]:AVERAGE " ;
$def[2] .= "AREA:cpuIdle#00CC00:\"Idle \" ";
$def[2] .= "GPRINT:cpuIdle:LAST:\"%3.2lf %sB CUR \" ";
$def[2] .= "GPRINT:cpuIdle:MAX:\"%3.2lf %sB MAX \" ";
$def[2] .= "GPRINT:cpuIdle" . ':AVERAGE:"%3.2lf %sB AVG \j" ';

$opt[3] = "--vertical-label \"% Used\" -l0 --title \"IO Wait for $hostname / $servicedesc\" ";
# CPU
$def[3] = "DEF:IoWait=$rrdfile:$DS[2]:AVERAGE " ;
$def[3] .= "AREA:IoWait#00CC00:\"IoWait \" ";
$def[3] .= "GPRINT:IoWait:LAST:\"%3.2lf %sB CUR \" ";
$def[3] .= "GPRINT:IoWait:MAX:\"%3.2lf %sB MAX \" ";
$def[3] .= "GPRINT:IoWait" . ':AVERAGE:"%3.2lf %sB AVG \j" ';
?>
Performance Data:
load=1.77;cpuUser=17.6;cpuSys=3.5;cpuIdle=;IoWait=0.1

Re: pnp4nagios hurts my head

Posted: Mon Jul 07, 2014 8:24 pm
by Box293
brandon.pal wrote: Graph on my screen looks to be the default graph and not my custom graph.
Yes you are right.

brandon.pal wrote:templates/check_cpu.php
The pnp template check_cpu.php will be used when the name of the check_command directive is check_cpu.

Some light reading ;)
http://docs.pnp4nagios.org/pnp-0.6/tpl

Re: pnp4nagios hurts my head

Posted: Mon Jul 07, 2014 10:26 pm
by brandon.pal
The check_command directive is check_cpu. I have multiple other plugins with custom graphs that I've created that work fine. This one for some reason is being completely ignored.

Re: pnp4nagios hurts my head

Posted: Mon Jul 07, 2014 10:34 pm
by Box293
Inside the perfdata/host_name/service_name.xml file, what is the value in the template field (there will be one for each datasource)?

Example:

Code: Select all

<TEMPLATE>check_http</TEMPLATE>
Also, do you have a link to the plugin, or can you attach it to your post please?

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 8:26 am
by brandon.pal

Code: Select all

<TEMPLATE>check_cpu</TEMPLATE>

check_cpu.sh

Code: Select all

#!/bin/bash

#set Argument Vars
loadW="$1"
loadC="$2"
cpuW="$3"
cpuC="$4"
ioW="$5"
ioC="$6"
errorLevel=0

#Run Top & Set Vars
topResult=$(top -b -n 1)
loadAverageAll=$(echo "$topResult" | sed -n 1p | awk '{print $10 " " $11 " " $12 " " $13 " " $14 " " $15}')
loadAverage15d=$(echo $loadAverageAll | awk '{print $5}' | sed 's/,//g')
loadAverage15=$(printf "%.0f" $loadAverage15d)
cpuResult=$(echo "$topResult" | sed -n 3p)
cpuUserd=$(echo "$cpuResult" | awk '{print $2}' | cut -d% -f1)
cpuUser=$(printf "%.0f" $cpuUserd)
cpuSystemd=$(echo "$cpuResult" | awk '{print $3}' | cut -d% -f1)
cpuSystem=$(printf "%.0f" $cpuSystemd)
cpuIdler=$(echo "$cpuResult" | awk '{print $5}' | cut -d% -f1)
cpuIdle=$(printf "%.0f" $cpuIdler)
cpuUsaged=$(echo "$cpuUserd + $cpuSystemd" | bc)
cpuUsage=$(echo "$cpuUser + $cpuSystem" | bc)

cpuIoWaitd=$(echo "$cpuResult" | awk '{print $6}' | cut -d% -f1)
cpuIoWait=$(printf "%.0f" $cpuIoWaitd)

#Check for Command Line Arguments
if [ $# -lt 6 ]
then
        echo "Missing Command Line Arguments"
        echo "Arg 1: 15 Minute Load Warning Level"
        echo "Arg 2: 15 Minute Load Critical Warning"
        echo "Arg 3: CPU Usage % Warning Level"
        echo "Arg 4: CPU Usage % Critical Level"
        echo "Arg 5: IO Wait % Warning Level"
        echo "Arg 6: IO Wait % Critical Level"
        echo " "
        echo "* Please do not include the % sign in your input"
        exit 3
fi

#Check LOAD Status
if [ $loadAverage15 -ge $loadW ] && [ $loadAverage15 -lt $loadC ]
then
        exitLevel=1
        msg="Load Warning ($loadAverage15%)"
elif [ $loadAverage15 -ge $loadC ]
then
        exitLevel=2
        msg="Load Critical ($loadAverage15%)"
elif [ $loadAverage15 -lt $loadW ] || [ $loadAverage15 -lt $loadC ]
then
        exitLevel=0
        msg="Load OK ($loadAverage15d%)"

else
        exitLevel=3
        msg="Load Average Error"
fi
#Check CPU Status
if [ $cpuUsage -ge $cpuW ] && [ $cpuUsage -lt $loadC ]
then
        msg="$msg CPU Warning ($cpuUsaged%)"
        case "$exitLevel" in
                0)
                exitLevel=1
                break
                ;;
                3)
                exitLevel=1
                ;;
        esac
elif [ $cpuUsage -gt $cpuC ]
then
        msg="$msg CPU Critical ($cpuUsaged%)"
        case "$exitLevel" in
                0)
                exitLevel=2
                break
                ;;
                1)
                exitLevel=2
                break
                ;;
                3)
                exitlevel=2
                break
                ;;
        esac
elif [ $cpuUsage -lt $cpuW ]
then
        msg="$msg CPU OK ($cpuUsaged%)"
else

        msg="$msg Load Average Error"
        case "$exitLevel" in
                0)
                exitLevel=3
                break
                ;;
        esac
fi


#Check IO Wait Status
if [ $cpuIoWait -ge $ioW ] && [ $cpuIoWait -lt $ioC ]
then
        msg="$msg IO Wait Warning ($cpuIoWait%)"
        case "$exitLevel" in
                0)
                exitLevel=1
                break
                ;;
                3)
                exitLevel=1
                ;;
        esac
elif [ $cpuIoWait -gt $ioC ]
then
        msg="$msg IO Wait Critical ($cpuIoWait%)"
        case "$exitLevel" in
                0)
                exitLevel=2
                break
                ;;
                1)
                exitLevel=2
                break
                ;;
                3)
                exitlevel=2
                break
                ;;
        esac
elif [ $cpuIoWait -lt $ioW ]
then
        msg="$msg IO Wait OK ($cpuIoWaitd%)"
else

        msg="$msg IO Wait Error"
        case "$exitLevel" in
                0)
                exitLevel=3
                break
                ;;
        esac
fi

        echo "$msg | load=$loadAverage15d%;cpuUser=$cpuUserd%;cpuSys=$cpuSystemd%;cpuIdle=$cpuIdle%;IoWait=$cpuIoWaitd%"
        echo "<pre>"
        echo "$loadAverageAll"
        echo "$cpuResult"
        echo "</pre>"
        exit $exitLevel

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 8:29 am
by brandon.pal
Ok very weird. It's now about 24 hours later and it's showing the correct graph. I promise it was the wrong one last night.

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 10:25 am
by sreinhardt
Interesting, that it would take so long to update properly. Does this seem to be continuing to work now?

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 12:14 pm
by brandon.pal
Yes & No.

It's still showing my custom template but

1) will not show the 3'rd graph for $DS[5]
The below code does not wor

Code: Select all

$opt[3] = "--vertical-label \"% Used\" -l0 --title \"IO Wait for $hostname / $servicedesc\" ";
# CPU
$def[3] = "DEF:IoWait=$rrdfile:$DS[5]:AVERAGE " ;
$def[3] .= "AREA:IoWait#00CC00:\"IoWait \" ";
$def[3] .= "GPRINT:IoWait:LAST:\"%3.2lf %sB CUR \" ";
$def[3] .= "GPRINT:IoWait:MAX:\"%3.2lf %sB MAX \" ";
$def[3] .= "GPRINT:IoWait" . ':AVERAGE:"%3.2lf %sB AVG \j" ';
If I change DS[5] to [4] it creates the graph but of course for the wrong info.

Code: Select all

$opt[3] = "--vertical-label \"% Used\" -l0 --title \"IO Wait for $hostname / $servicedesc\" ";
# CPU
$def[3] = "DEF:IoWait=$rrdfile:$DS[4]:AVERAGE " ;
$def[3] .= "AREA:IoWait#00CC00:\"IoWait \" ";
$def[3] .= "GPRINT:IoWait:LAST:\"%3.2lf %sB CUR \" ";
$def[3] .= "GPRINT:IoWait:MAX:\"%3.2lf %sB MAX \" ";
$def[3] .= "GPRINT:IoWait" . ':AVERAGE:"%3.2lf %sB AVG \j" ';
2) I've tried to create a stacked graph and if only using LINE it works great but if I want to use AREA they won't show all 3 stacks only the biggest one. Is there not a way to stack a graph using area?

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 3:52 pm
by brandon.pal
I have fixed the stacked graph issue but still unable to get my 3rd graph to work.

Re: pnp4nagios hurts my head

Posted: Tue Jul 08, 2014 6:50 pm
by Box293
Your perfdata string is not correctly formatted.
brandon.pal wrote:load=1.77;cpuUser=17.6;cpuSys=3.5;cpuIdle=;IoWait=0.1
It should look like this:

Code: Select all

load=1.77 cpuUser=17.6 cpuSys=3.5 cpuIdle= IoWait=0.1
Instead of semi columns seperating the units it needs to be spaces.

Also, in the example above, cpuIdle= does not have a value, that will also cause issues.

To explain how I determined this:
  • I edited /usr/local/nagios/etc/pnp/process_perfdata.cfg and set LOG_LEVEL = 1
    I then forced Nagios to execute the check, the check is executing:

    Code: Select all

    check_cpu.sh 10 20 70 80 30 40
    The following then appeared in the log file: /usr/local/nagios/var/perfdata.log

    Code: Select all

    Found Performance Data for localhost / Current_Load_-_Custom_Check (load=%;cpuUser=8.1%;cpuSys=1.8%;cpuIdle=90%;IoWait=0.2%)
    Invalid Perfdata detected
    2 lines processed
    I also observed that no .rrd or .xml files had been created
    I then edited you script and changed line 152 to (replaced semi columns with spaces):

    Code: Select all

    echo "$msg | load=$loadAverage15d% cpuUser=$cpuUserd% cpuSys=$cpuSystemd% cpuIdle=$cpuIdle% IoWait=$cpuIoWaitd%"
    I then forced Nagios to execute the check and the following appeared in the log file:

    Code: Select all

    Found Performance Data for localhost / Current_Load_-_Custom_Check (load=% cpuUser=8.3% cpuSys=1.8% cpuIdle=90% IoWait=0.2%)
    Invalid Perfdata detected
    2 lines processed
    I also observed that no .rrd or .xml files had been created
    I noticed that load=% has no value, so I edited I then edited you script and changed line 152 to:

    Code: Select all

    echo "$msg | load=0% cpuUser=$cpuUserd% cpuSys=$cpuSystemd% cpuIdle=$cpuIdle% IoWait=$cpuIoWaitd%"
    I changed load=0% just to determine that the string is correctly formatted.
    I then forced Nagios to execute the check and the following appeared in the log file:

    Code: Select all

    Found Performance Data for localhost / Current_Load_-_Custom_Check (load=0% cpuUser=8.3% cpuSys=1.8% cpuIdle=90% IoWait=0.2%)
    3 lines processed
    Also, the .rrd and .xml files now exists
So what I would do now is:
  • Update the script as demonstrated
    Make sure the script determines that the variables have valid values
    Delete the existing .rrd and .xml files that already contain perfdata (as it will cause issues in your pnp graphs)
    Wait for enough data to be collected
    Re-visit your custom graphs and see if they are working now
Some information that is helpful:
Nagios Plugin Development Guidelines
However, it is the responsibility of the plugin writer to ensure the performance data is in a "Nagios Plugins" format. This is the expected format:

'label'=value[UOM];[warn];[crit];[min];[max]

Notes:

space separated list of label/value pairs

label can contain any characters except the equals sign or single quote (')

the single quotes for the label are optional. Required if spaces are in the label

label length is arbitrary, but ideally the first 19 characters are unique (due to a limitation in RRD). Be aware of a limitation in the amount of data that NRPE returns to Nagios

to specify a quote character, use two single quotes

warn, crit, min or max may be null (for example, if the threshold is not defined or min and max do not apply). Trailing unfilled semicolons can be dropped

min and max are not required if UOM=%

value, min and max in class [-0-9.]. Must all be the same UOM. value may be a literal "U" instead, this would indicate that the actual value couldn't be determined

warn and crit are in the range format (see the Section called Threshold and ranges). Must be the same UOM

UOM (unit of measurement) is one of:
no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
s - seconds (also us, ms)
% - percentage
B - bytes (also KB, MB, TB)
c - a continous counter (such as bytes transmitted on an interface)