Page 1 of 2

custom Nagios check script not returning value to Nagios

Posted: Fri May 31, 2013 1:28 pm
by theace18
I'm having trouble running a custom Nagios check script that I created for Nagios. The purpose of the script is to check for a degraded state in on our MegaRAID cards and report it back to Nagios.

Code: Select all

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
else
        echo "OK: No Errors Found"
        exit 0
fi
On my server that this script runs on, I have given Nagios sudo access with no password) to ONLY this script and the MegaCLi64 binary, since MegaCLi64 will only report the data back if you run at the root user. Now here here an interesting twist:

If I run this script locally on the server, it runs fine and returns the proper error code.

[root@nas1 nrpe_raid_monitor]# su nagios
sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present


However if I run the script through NRPE/Nagios, it always returns as if the status is OK, even though it isn't.

Any thoughts on what I could be doing wrong. This is the first time I've ever written custom Nagios checks so I may be doing something wrong.

Any help would be greatly appreciated. Thanks!

Re: custom Nagios check script not returning value to Nagios

Posted: Fri May 31, 2013 2:36 pm
by abrist
First of all, it will always return OK, even if the syntax is or permissions are incorrect due to the catch-all "else" statement.
theace18 wrote:else
echo "OK: No Errors Found"
exit 0
fi
You may want to add a bit more validation/logic to the plugin.

Can you run this from the cli on the nagios server through nrpe and post the output?

EDIT: I suggest you also return the output from the actual check or write the output to a file in /tmp to facilitate troubleshooting.

Re: custom Nagios check script not returning value to Nagios

Posted: Fri May 31, 2013 4:11 pm
by theace18
So I updated the script a bit:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi
The script is basically looking for a Degraded state on the MegaCLI utility. So when I run the script as the nagios user I get:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00


Most of that info up above is not needed. I just want to know if I have a Virtual Drive that is degraded, and how many. So I run this:


sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l
1


The 1 represents I have a degraded drive on my array. Hence the alert

So then I run the script manually on my nas1 server as the nagios user:

sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present


It even exits with an exit status of 2, which is suppose to trigger the CRITICAL alarm:

sh-3.2$ echo $?
2


However, when I run the script through NRPE on the Nagios server, it says everything is OK.

root@nagios libexec]# ./check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found


I'm at a loss. Any suggestions? Thanks in advance.

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 11:29 am
by abrist
Make the script output to a file as well:

Code: Select all

sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test
This way we can actually see what is returned from the nrpe call.

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 12:20 pm
by theace18
Ran as the Nagios user:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >/tmp/test
sh-3.2$ cat /tmp/test



Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00


So I see there is that Exit Code: 0x00. Now I could see where that would be a problem, but in my custom script, I just tell it to grep for the word "Degraded" so I can look for degraded volumes. So that Exit Code: 0x00 wouldn't matter, would it?

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 12:44 pm
by abrist
I don't think it would matter. Try sticking that bit of code into your script so you can check the output while running the actual check:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 4:32 pm
by theace18
Ah thank you!

So entering a little bit of code to output the results helped.

I changed the script output both the wc -l and the full MegaCLI64 command output to /tmp/test.

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l >/tmp/test
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >>/tmp/test

if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi
So again I can run this custom script with the nagios user. It outputs the the correct values to /tmp/test

sh-3.2$./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present
sh-3.2$ cat /tmp/test
1


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00

Then I try to run it through the Nagios server manually using NRPE:

[root@nagios servers]# /usr/local/nagios/libexec/check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found

And the output of /tmp/test:

sh-3.2$ cat /tmp/test
0


So this tells me that the script is running, but it's not displaying anything because it can't run the MegaCli64 program successfully. Which doesn't make sense. I gave sudo rights to the Nagios user on the nas, and set to ALL with NOPASSWD. Furthermore, I can run the script manually just fine using the nagios user. So I'm REALLY confused now. Any thoughts?

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 4:47 pm
by sreinhardt
Can you show us the output from the following commands please:

Code: Select all

cat /etc/xinetd.d/nrpe
cat /etc/sudoers | grep nagios
Also I might suggest referencing your programs with the full path names, such as /usr/bin/sudo to avoid any path issues with nrpe.

Re: custom Nagios check script not returning value to Nagios

Posted: Mon Jun 03, 2013 5:04 pm
by theace18
I modified my script to utilize the full path for sudo: /usr/bin/sudo. Still doesn't work.

NRPE runs on all of our server as a Daemon, not through Xinetd.

Below is the nagios line in the /etc/sudoers files:

[root@nas1 nagios]# cat /etc/sudoers|grep nagios
nagios ALL=(ALL) NOPASSWD: ALL

Re: custom Nagios check script not returning value to Nagios

Posted: Tue Jun 04, 2013 2:22 pm
by abrist
I ran into similar issues when trying to use a 3ware raid binary on one of my core installations. I had to to set the execute bit on the 3ware bin.

Alternatively you could try adding 'sudo' to the nrpe.cfg command itself:

Code: Select all

command[check_raid_status]=sudo /usr/local/nagios/libexec/raid_monitoring_megaraid.sh