SSH Proxy command is not working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
isat-iad
Posts: 11
Joined: Fri Mar 07, 2014 6:15 am

SSH Proxy command is not working

Post by isat-iad »

Hi,
We have a slightly bizarre issue, where an SSH Proxy command doesn't work from Nagios XI but it works from the command line.
From the command line:

Code: Select all

$ /usr/local/nagios/libexec/check_by_ssh -H hostname -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
$ echo $?
0
$
From the Nagios XI interface, we get:
(Return code of 255 is out of bounds)

The permissions on the SSH keys are correct and key verification has been done for both name and IP of the monitored host.

When we try to run "Test Check Command" from CCM, we get a different error:

Code: Select all

OUTPUT: Remote command execution failed: Could not create directory '/var/www/.ssh'.
Or if we pass the '-E' flag to the check_by_ssh command, then we get:

Code: Select all

OUTPUT: UNKNOWN - check_by_ssh: Remote command '/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80' returned status 255
(Which it doesn't - it returns 0, 1 or 2)

We tried single quotes on the command instead of double quotes but no luck.

Any ideas?

Thanks,
Michael
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SSH Proxy command is not working

Post by lmiltchev »

What is the "-d" flag? I don't see this as a option in the usage page... It seems like that if the thresholds are placed AFTER the path in the command, they are not being considered... For example:

This will should work correctly:

Code: Select all

[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk -w 20% -c 10% /"
DISK OK - free space: / 6983 MB (41% inode=86%);| /=9796MB;14137;15904;0;17672
[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk -w 45% -c 35% /"
DISK WARNING - free space: / 6983 MB (41% inode=86%);| /=9796MB;9719;11486;0;17672
[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk -w 60% -c 50% /"
DISK CRITICAL - free space: / 6983 MB (41% inode=86%);| /=9796MB;7068;8836;0;17672
but this won't (it will always show "OK"):

Code: Select all

[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk / -w 20% -c 10%"
DISK OK - free space: / 6983 MB (41% inode=86%);| /=9796MB;;;0;17672
You have new mail in /var/spool/mail/root
[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk / -w 45% -c 35%"
DISK OK - free space: / 6982 MB (41% inode=86%);| /=9796MB;;;0;17672
[nagios@testbox libexec]$ ./check_by_ssh -H 192.168.x.x -C "/usr/local/nagios/libexec/check_disk / -w 60% -c 50%"
DISK OK - free space: / 6982 MB (41% inode=86%);| /=9796MB;;;0;17672
Don't worry about this output:
OUTPUT: Remote command execution failed: Could not create directory '/var/www/.ssh'.
Testing from the CCM is not 100% reliable (it doesn't work with all checks) and it's not supposed to be a substitute for testing from the CLI.

Go to the CCM->Services, and modify the text in the $ARG1$ field as such:

Code: Select all

-C "/opt/nagios/libexec/check_disk -w 20 -c 10 /export/home"


Save and Apply Configuration. Go to Home->Service Detail, click on the service and schedule an immediate check to see if you are going to get the correct output this time.

Note: You can use whatever thresholds you need to. This is just an example.
Be sure to check out our Knowledgebase for helpful articles and solutions!
isat-iad
Posts: 11
Joined: Fri Mar 07, 2014 6:15 am

Re: SSH Proxy command is not working

Post by isat-iad »

service_OK.png
Hi,

I should have mentioned, that we are using custom scripts on this monitored host - the check_disk script is not the same as the check_disk script supplied by Nagios, that's why it has a '-d' flag. Having said that, we have exactly the same custom script running on other servers and it works without an issue. And we have other custom scripts that don't work for that specific host. Which makes me think that there may be something to do with the version of SSH (SunSSH for Solaris 9 on Sparc).

But what puzzles me is that the check works every time when I run it from the CLI of the Nagios XI box against that host. Furthermore, when I click "schedule an immediate check" from the Nagios XI service screen, I see it working (Green) the first time and subsequently it changes to CRITICAL with this message "(Return code of 255 is out of bounds)".

Thanks,
Michael
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SSH Proxy command is not working

Post by lmiltchev »

Can you run the script locally on the remote machine as a nagios user, or only as root? Did you use the following document for setting up check_by_ssh?

http://assets.nagios.com/downloads/nagi ... ng_SSH.pdf

Run the following commands from the CLI on the Nagios XI server and show us the output:

Code: Select all

su nagios
/usr/local/nagios/libexec/check_by_ssh -H <remoteip> -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
Be sure to check out our Knowledgebase for helpful articles and solutions!
isat-iad
Posts: 11
Joined: Fri Mar 07, 2014 6:15 am

Re: SSH Proxy command is not working

Post by isat-iad »

The scripts on the remote machine run as the nagios user and we did follow the document for setting them up - we have exactly the same scripts working on other machines.

When I run the command manually from the CLI of the Nagios XI box, it always works, and it works both using the hostname and the IP address of the remote machine. It is only from the Nagios XI GUI where it doesn't work....

Code: Select all

[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H hostname -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ /usr/local/nagios/libexec/check_by_ssh -H remoteip -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
OK - /export/home space used=61% | '/export/home usage'=61%;80;90;
[nagios@nagiosxi ~]$ id
uid=501(nagios) gid=100(users) groups=100(users),501(nagios),502(nagcmd)
[nagios@nagiosxi ~]$
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: SSH Proxy command is not working

Post by slansing »

Can we take a look at your check_by_ssh command in XI? The one you are using for these services? Navigate to Configure > CCM > Commands > "name of the check_command you are using > Copy it's command_line, and post it here. Can we also see the service configuration for either this service or the Disk Health one? Are you using the exact same command in nagios for other servers? Or do you have multiple derivatives? Are the other services on that host also using the same check_command? check by ssh?
isat-iad
Posts: 11
Joined: Fri Mar 07, 2014 6:15 am

Re: SSH Proxy command is not working

Post by isat-iad »

Hello,
The command line for disk_health.ksh is

Code: Select all

$USER1$/disk_health.ksh $ARG1$
The command line for check_by_ssh is

Code: Select all

$USER1$/check_by_ssh -H $HOSTADDRESS$ $ARG1$ $ARG2$
The service configuration file is below:

Code: Select all

[root@nagiosxi services]# cat solaris_disk_health.cfg
###############################################################################
#
# Service configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date:       2014-06-20 11:01:43
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################

define service {
        host_name                       hostname
        service_description             Disk Health
        hostgroup_name                  aims-servers
        check_command                   check_nrpe!check_disk_health!-a 'noiostat'!!!!!!
        max_check_attempts              3
        check_interval                  60
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        register                        1
        }

###############################################################################
#
# Service configuration file
#
# END OF FILE
#
###############################################################################
[root@nagiosxi services]#

And the script that runs on the monitored host is:

Code: Select all

bash-3.2# cat disk_health.ksh
#!/bin/ksh
#
# #######################################################
# Script that checks for hard disk problems or errors
# #######################################################
#
# Define the meta commands
METASTAT=/usr/sbin/metastat
METADB=/usr/sbin/metadb
ZPOOL="/usr/sbin/zpool status -x"
#
# Work out Solaris release
RELEASE=$(uname -r | cut -d'.' -f2)
#
# Define iostat thresholds
# for errors and warnings
SOFTERR_WARN_THRES=10
HARDERR_WARN_THRES=2
TRNSERR_WARN_THRES=2
SOFTERR_CRIT_THRES=25
HARDERR_CRIT_THRES=10
TRNSERR_CRIT_THRES=10
#
# Return code
RET_CODE=0

if [ -n $1 ]
then
        ARG1=$1
else
        ARG1=false
fi

# Check for SVM errors
function meta_check {

        # Check for metadb trouble - on the basis that
        # upper case letter indicate problems.
        DBTROUBLE=`$METADB | tail +2 | /usr/bin/awk \
        '{ fl = substr($0,1,20); if (fl ~ /[A-Z]/) print $0 }'`

        # Check for metadevice trouble - on the basis that
        # a faulty disk will have status other than 'Okay'
        MDTROUBLE=`$METASTAT | /usr/bin/awk \
        '/State:/ { if ( $2 != "Okay" ) print "$0" }'`

        if [ ! -z $MDTROUBLE ]
        then
                echo "CRITICAL - SVM (metastat) is reporting errors"
                RET_CODE=2
        fi
}

function zfs_check {

        if [ -x $ZPOOL ]
        then
                ZPOOL_STATUS=$ZPOOL
                case $ZPOOL_STATUS in
                "no pools available")
                        ;;
                "all pools are healthy")
                        ;;
                "*")
                        echo "CRITICAL - zpool is reporting errors"
                        RET_CODE=2
                        ;;
                esac
        fi
}

if [[ $ARG1 != "noiostat" ]];
then
# Check iostat for device errors
/usr/bin/iostat -En | /usr/bin/egrep \
        'c[0-9]+t[0-9]+d[0-9]+' > /tmp/iostat.tmp

        if [ -s /tmp/iostat.tmp ]
        then
        while read line
        do

                DEV=`echo $line | awk '{print $1}'`
                SOFTERR=`echo $line | awk '{print $4}'`
                HARDERR=`echo $line | awk '{print $7}'`
                TRNSERR=`echo $line | awk '{print $10}'`
                if [ $SOFTERR -gt $SOFTERR_WARN_THRES -a $SOFTERR -lt $SOFTERR_C                                              RIT_THRES ]
                then
                        /usr/ucb/echo -n "WARNING - "
                        echo "Device $DEV has $SOFTERR soft errors."
                        RET_CODE=1
                fi
                if [ $SOFTERR -ge $SOFTERR_CRIT_THRES ]
                then
                        /usr/ucb/echo -n "CRITICAL - "
                        echo "Device $DEV has $SOFTERR soft errors."
                        RET_CODE=2
                fi
                if [ $HARDERR -gt $HARDERR_WARN_THRES -a $HARDERR -lt $HARDERR_C                                              RIT_THRES ]
                then
                        /usr/ucb/echo -n "WARNING - "
                        echo "Device $DEV has $HARDERR hard errors."
                        RET_CODE=1
                fi
                if [ $HARDERR -ge $HARDERR_CRIT_THRES ]
                then
                        /usr/ucb/echo -n "CRITICAL - "
                        echo "Device $DEV has $HARDERR hard errors."
                        RET_CODE=2
                fi
                if [ $TRNSERR -gt $TRNSERR_WARN_THRES -a $TRNSERR -lt $TRNSERR_C                                              RIT_THRES ]
                then
                        /usr/ucb/echo -n "WARNING - "
                        echo "Device $DEV has $TRNSERR transport errors."
                        RET_CODE=1
                fi
                if [ $TRNSERR -ge $TRNSERR_CRIT_THRES ]
                then
                        /usr/ucb/echo -n "CRITICAL - "
                        echo "Device $DEV has $TRNSERR transport errors."
                        RET_CODE=2
                fi

        done < /tmp/iostat.tmp
        /bin/rm /tmp/iostat.tmp
        fi
fi

# Check if system is SVM-enabled and run SVM checks
$METASTAT > /dev/null 2>&1
if [ $? -eq 0 ]
then
        meta_check
fi

# If Solaris release is 10 or higher check for ZFS status
if [ $RELEASE -ge 10 ]
then
        zfs_check
fi

if [ $RET_CODE -eq 0 ]
then
        echo "OK - disk health is good"
fi
exit $RET_CODE
bash-3.2#

We are using the exact same command in Nagios XI for other servers and it works. We don't have any other services configured for that host using check_by_ssh - all the ones we have exhibit the same symptoms, they all work fine from the Nagios command line but from the XI Web page they "flap" between OK and (Return code of 255 is out of bounds).

Thanks
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SSH Proxy command is not working

Post by lmiltchev »

How did you set up this check?
define service {
host_name hostname
service_description Disk Health
hostgroup_name aims-servers
check_command check_nrpe!check_disk_health!-a 'noiostat'!!!!!!
max_check_attempts 3
check_interval 60
retry_interval 1
check_period xi_timeperiod_24x7
notification_period xi_timeperiod_24x7
contacts operations
register 1
}
This doesn't seem at all like the check you've been running (testing) from the CLI...
/usr/local/nagios/libexec/check_by_ssh -H hostname -C "/opt/nagios/libexec/check_disk -d /export/home -c 90 -w 80"
If you ran the "SSH Proxy Wizard", you would have something like this:

Code: Select all

define service {
	host_name			CentOS-SSH
	service_description		Root Disk Space
	use				generic-service
	check_command			check_xi_by_ssh!-C "/usr/local/nagios/libexec/check_disk -w 80% -c 90% /"!!!!!!!
	max_check_attempts		5
	check_interval			5
	retry_interval			1
	check_period			xi_timeperiod_24x7
	notification_interval		60
	notification_period		xi_timeperiod_24x7
	notifications_enabled		1
	contacts			nagiosadmin
	_xiwizard			sshproxy
	register			1
	}
It seems like you are using check_nrpe, not check_by_ssh...
Be sure to check out our Knowledgebase for helpful articles and solutions!
isat-iad
Posts: 11
Joined: Fri Mar 07, 2014 6:15 am

Re: SSH Proxy command is not working

Post by isat-iad »

Perhaps it was a bad example, but exactly the same thing happens for the check_by_ssh service (which works on other servers):

Code: Select all

[root@nagiosxi services]# cat hostname.cfg
###############################################################################
#
# Service configuration file
#
# Created by: Nagios QL Version 3.0.3
# Date:       2014-06-24 10:38:18
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################

define service {
        host_name                       hostname
        service_description             /export/home Disk Space
        use                             generic-service
        check_command                   check_xi_by_ssh!-C "/opt/nagios/libexec/check_disk -w 80 -c 90 -d /export/home"!!!!!!!
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        _xiwizard                       sshproxy
        register                        1
        }

define service {
        host_name                       hostname
        service_description             Disk Health
        use                             generic-service
        check_command                   check_xi_by_ssh!-C "/opt/nagios/libexec/disk_health.ksh noiostat"
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        _xiwizard                       sshproxy
        register                        1
        }

define service {
        host_name                       hostname
        service_description             HTTP
        use                             xiwizard_website_http_service
        check_command                   check_xi_service_http! -f ok -I IP_ADDR -u "/" -p 80
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        _xiwizard                       website
        register                        1
        }

define service {
        host_name                       hostname
        service_description             Ping
        use                             xiwizard_linuxserver_ping_service
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        _xiwizard                       sshproxy
        register                        1
        }

define service {
        host_name                       hostname
        service_description             Root Disk Space
        use                             generic-service
        check_command                   check_xi_by_ssh!-C "/opt/nagios/libexec/check_disk -d / -c 90 -w 80"
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_interval           60
        notification_period             xi_timeperiod_24x7
        contacts                        operations
        _xiwizard                       sshproxy
        register                        1
        }

###############################################################################
#
# Service configuration file
#
# END OF FILE
#
###############################################################################
[root@nagiosxi services]#
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: SSH Proxy command is not working

Post by sreinhardt »

Could you post your check_disk script or a link to where we can find it? 255, means that it is not exiting within 0-255 as an exitcode, and is outside standards. I find it interesting that it seems to complete some times, but possibly it is also erroring some of the time prior to a proper return.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Locked