custom Nagios check script not returning value to Nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
theace18
Posts: 17
Joined: Fri May 31, 2013 11:25 am

custom Nagios check script not returning value to Nagios

Post by theace18 »

I'm having trouble running a custom Nagios check script that I created for Nagios. The purpose of the script is to check for a degraded state in on our MegaRAID cards and report it back to Nagios.

Code: Select all

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
else
        echo "OK: No Errors Found"
        exit 0
fi
On my server that this script runs on, I have given Nagios sudo access with no password) to ONLY this script and the MegaCLi64 binary, since MegaCLi64 will only report the data back if you run at the root user. Now here here an interesting twist:

If I run this script locally on the server, it runs fine and returns the proper error code.

[root@nas1 nrpe_raid_monitor]# su nagios
sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present


However if I run the script through NRPE/Nagios, it always returns as if the status is OK, even though it isn't.

Any thoughts on what I could be doing wrong. This is the first time I've ever written custom Nagios checks so I may be doing something wrong.

Any help would be greatly appreciated. Thanks!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: custom Nagios check script not returning value to Nagios

Post by abrist »

First of all, it will always return OK, even if the syntax is or permissions are incorrect due to the catch-all "else" statement.
theace18 wrote:else
echo "OK: No Errors Found"
exit 0
fi
You may want to add a bit more validation/logic to the plugin.

Can you run this from the cli on the nagios server through nrpe and post the output?

EDIT: I suggest you also return the output from the actual check or write the output to a file in /tmp to facilitate troubleshooting.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
theace18
Posts: 17
Joined: Fri May 31, 2013 11:25 am

Re: custom Nagios check script not returning value to Nagios

Post by theace18 »

So I updated the script a bit:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi
The script is basically looking for a Degraded state on the MegaCLI utility. So when I run the script as the nagios user I get:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00


Most of that info up above is not needed. I just want to know if I have a Virtual Drive that is degraded, and how many. So I run this:


sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l
1


The 1 represents I have a degraded drive on my array. Hence the alert

So then I run the script manually on my nas1 server as the nagios user:

sh-3.2$ ./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present


It even exits with an exit status of 2, which is suppose to trigger the CRITICAL alarm:

sh-3.2$ echo $?
2


However, when I run the script through NRPE on the Nagios server, it says everything is OK.

root@nagios libexec]# ./check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found


I'm at a loss. Any suggestions? Thanks in advance.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: custom Nagios check script not returning value to Nagios

Post by abrist »

Make the script output to a file as well:

Code: Select all

sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test
This way we can actually see what is returned from the nrpe call.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
theace18
Posts: 17
Joined: Fri May 31, 2013 11:25 am

Re: custom Nagios check script not returning value to Nagios

Post by theace18 »

Ran as the Nagios user:

sh-3.2$ sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >/tmp/test
sh-3.2$ cat /tmp/test



Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00


So I see there is that Exit Code: 0x00. Now I could see where that would be a problem, but in my custom script, I just tell it to grep for the word "Degraded" so I can look for degraded volumes. So that Exit Code: 0x00 wouldn't matter, would it?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: custom Nagios check script not returning value to Nagios

Post by abrist »

I don't think it would matter. Try sticking that bit of code into your script so you can check the output while running the actual check:

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL > /tmp/test
if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
theace18
Posts: 17
Joined: Fri May 31, 2013 11:25 am

Re: custom Nagios check script not returning value to Nagios

Post by theace18 »

Ah thank you!

So entering a little bit of code to output the results helped.

I changed the script output both the wc -l and the full MegaCLI64 command output to /tmp/test.

Code: Select all

#!/bin/sh

# script to check the status of the RAID volume on a server with a Megaraid RAID card

# Get status of RAID card

COUNT=`sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l`
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL|grep Degraded|wc -l >/tmp/test
sudo /usr/local/bin/MegaCli64 -LDInfo -Lall -aALL >>/tmp/test

if [ $COUNT -ge 1 ]
then
        echo "CRITICAL: RAID Errors Present"
        exit 2
elif [ $COUNT -eq 0 ]
then
        echo "OK: No Errors Found"
        exit 0
fi
So again I can run this custom script with the nagios user. It outputs the the correct values to /tmp/test

sh-3.2$./raid_monitoring_megaraid.sh
CRITICAL: RAID Errors Present
sh-3.2$ cat /tmp/test
1


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 233.312 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None


Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 3.637 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 2
Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None



Exit Code: 0x00

Then I try to run it through the Nagios server manually using NRPE:

[root@nagios servers]# /usr/local/nagios/libexec/check_nrpe -H nas1 -c check_raid_status
OK: No Errors Found

And the output of /tmp/test:

sh-3.2$ cat /tmp/test
0


So this tells me that the script is running, but it's not displaying anything because it can't run the MegaCli64 program successfully. Which doesn't make sense. I gave sudo rights to the Nagios user on the nas, and set to ALL with NOPASSWD. Furthermore, I can run the script manually just fine using the nagios user. So I'm REALLY confused now. Any thoughts?
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: custom Nagios check script not returning value to Nagios

Post by sreinhardt »

Can you show us the output from the following commands please:

Code: Select all

cat /etc/xinetd.d/nrpe
cat /etc/sudoers | grep nagios
Also I might suggest referencing your programs with the full path names, such as /usr/bin/sudo to avoid any path issues with nrpe.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
theace18
Posts: 17
Joined: Fri May 31, 2013 11:25 am

Re: custom Nagios check script not returning value to Nagios

Post by theace18 »

I modified my script to utilize the full path for sudo: /usr/bin/sudo. Still doesn't work.

NRPE runs on all of our server as a Daemon, not through Xinetd.

Below is the nagios line in the /etc/sudoers files:

[root@nas1 nagios]# cat /etc/sudoers|grep nagios
nagios ALL=(ALL) NOPASSWD: ALL
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: custom Nagios check script not returning value to Nagios

Post by abrist »

I ran into similar issues when trying to use a 3ware raid binary on one of my core installations. I had to to set the execute bit on the 3ware bin.

Alternatively you could try adding 'sudo' to the nrpe.cfg command itself:

Code: Select all

command[check_raid_status]=sudo /usr/local/nagios/libexec/raid_monitoring_megaraid.sh
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked