custom Nagios check script not returning value to Nagios

theace18 · Post by **theace18** » Fri May 31, 2013 1:28 pm

I'm having trouble running a custom Nagios check script that I created for Nagios. The purpose of the script is to check for a degraded state in on our MegaRAID cards and report it back to Nagios.

Code: Select all

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
else
        echo "OK: No Errors Found"
        exit 0
fi

On my server that this script runs on, I have given Nagios sudo access with no password) to ONLY this script and the MegaCLi64 binary, since MegaCLi64 will only report the data back if you run at the root user. Now here here an interesting twist:

If I run this script locally on the server, it runs fine and returns the proper error code.

[root@nas1 nrpe_raid_monitor]# su nagios
sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present

However if I run the script through NRPE/Nagios, it always returns as if the status is OK, even though it isn't.

Any thoughts on what I could be doing wrong. This is the first time I've ever written custom Nagios checks so I may be doing something wrong.

Any help would be greatly appreciated. Thanks!

abrist · Post by **abrist** » Fri May 31, 2013 2:36 pm

First of all, it will always return OK, even if the syntax is or permissions are incorrect due to the catch-all "else" statement.

theace18 wrote:else
echo "OK: No Errors Found"
exit 0
fi

You may want to add a bit more validation/logic to the plugin.

Can you run this from the cli on the nagios server through nrpe and post the output?

EDIT: I suggest you also return the output from the actual check or write the output to a file in /tmp to facilitate troubleshooting.

theace18 · Post by **theace18** » Fri May 31, 2013 4:11 pm

So I updated the script a bit:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi

The script is basically looking for a Degraded state on the MegaCLI utility. So when I run the script as the nagios user I get:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Exit Code: 0x00

Most of that info up above is not needed. I just want to know if I have a Virtual Drive that is degraded, and how many. So I run this:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l
1

The 1 represents I have a degraded drive on my array. Hence the alert

So then I run the script manually on my nas1 server as the nagios user:

sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present

It even exits with an exit status of 2, which is suppose to trigger the CRITICAL alarm:

sh-3.2$ echo $?
2

However, when I run the script through NRPE on the Nagios server, it says everything is OK.

root@nagios libexec]# ./check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found

I'm at a loss. Any suggestions? Thanks in advance.

abrist · Post by **abrist** » Mon Jun 03, 2013 11:29 am

Make the script output to a file as well:

Code: Select all

sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test

This way we can actually see what is returned from the nrpe call.

theace18 · Post by **theace18** » Mon Jun 03, 2013 12:20 pm

Ran as the Nagios user:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >/tmp/test
sh-3.2$ cat /tmp/test

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Exit Code: 0x00

So I see there is that Exit Code: 0x00. Now I could see where that would be a problem, but in my custom script, I just tell it to grep for the word "Degraded" so I can look for degraded volumes. So that Exit Code: 0x00 wouldn't matter, would it?

abrist · Post by **abrist** » Mon Jun 03, 2013 12:44 pm

I don't think it would matter. Try sticking that bit of code into your script so you can check the output while running the actual check:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi

theace18 · Post by **theace18** » Mon Jun 03, 2013 4:32 pm

Ah thank you!

So entering a little bit of code to output the results helped.

I changed the script output both the wc -l and the full MegaCLI64 command output to /tmp/test.

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l >/tmp/test
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >>/tmp/test

if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi

So again I can run this custom script with the nagios user. It outputs the the correct values to /tmp/test

sh-3.2$./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present
sh-3.2$ cat /tmp/test
1

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None

Exit Code: 0x00

Then I try to run it through the Nagios server manually using NRPE:

[root@nagios servers]# /usr/local/nagios/libexec/check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found

And the output of /tmp/test:

sh-3.2$ cat /tmp/test
0

So this tells me that the script is running, but it's not displaying anything because it can't run the MegaCli64 program successfully. Which doesn't make sense. I gave sudo rights to the Nagios user on the nas, and set to ALL with NOPASSWD. Furthermore, I can run the script manually just fine using the nagios user. So I'm REALLY confused now. Any thoughts?

sreinhardt · Post by **sreinhardt** » Mon Jun 03, 2013 4:47 pm

Can you show us the output from the following commands please:

Code: Select all

cat /etc/xinetd.d/nrpe
cat /etc/sudoers | grep nagios

Also I might suggest referencing your programs with the full path names, such as /usr/bin/sudo to avoid any path issues with nrpe.

theace18 · Post by **theace18** » Mon Jun 03, 2013 5:04 pm

I modified my script to utilize the full path for sudo: /usr/bin/sudo. Still doesn't work.

NRPE runs on all of our server as a Daemon, not through Xinetd.

Below is the nagios line in the /etc/sudoers files:

[root@nas1 nagios]# cat /etc/sudoers|grep nagios
nagios ALL=(ALL) NOPASSWD: ALL

abrist · Post by **abrist** » Tue Jun 04, 2013 2:22 pm

I ran into similar issues when trying to use a 3ware raid binary on one of my core installations. I had to to set the execute bit on the 3ware bin.

Alternatively you could try adding 'sudo' to the nrpe.cfg command itself:

Code: Select all

command[check_raid_status]=sudo /usr/local/nagios/libexec/raid_monitoring_megaraid.sh

Nagios Support Forum

custom Nagios check script not returning value to Nagios

custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios

Re: custom Nagios check script not returning value to Nagios