N Server shows CRITICAL alert, but check_disk is DISK OK

coactmwp · Post by **coactmwp** » Wed Jul 03, 2019 2:37 pm

Environment: AIX LPAR, running AIX 7.1_TL04_SP03
Nagios version for AIX: 2.0.1.0

25 days ago, our Nagios Server issued a CRITICAL 'Check Disk' alert on the /opt (/dev/hd10opt) filesystem, indicating "DISK CRITICAL - free space: /opt 26 MB (1% inode=32%)"

'df' on the LPAR showed that the /opt filesystem appeared to be fine:

hrmsdbp > / # df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 0.50 0.31 39% 3963 6% /
/dev/hd2 4.34 1.81 59% 45930 10% /usr
/dev/hd9var 2.00 1.80 10% 3530 1% /var
/dev/hd3 1.00 1.00 1% 59 1% /tmp
/dev/hd1 0.50 0.50 1% 121 1% /home
/dev/hd11admin 0.12 0.12 1% 7 1% /admin
/proc - - - - - /proc
/dev/hd10opt 2.00 0.78 62% 19963 10% /opt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/datalv 0.12 0.12 1% 64 1% /data

I thought the issue was over-consumed and unreleased filesystem inodes, so I scheduled a maintenance window to reboot the LPAR.
The reboot of the LPAR had no effect on the alert on the Nagios Server.

To diagnose the problem, I ran the 'check_disk' executable local to the system, and got this:

hrmsdbp > / # /usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048

The local 'check_disk' command returns "DISK OK" on the filesystem that the Nagios Server thinks has a problem.

I engaged our Nagios administrator, and she doubled-check that the correct command was referenced properly in the Nagios Server config for the LPAR:
Nagios Server:

service_description Check /opt
check_command check_nrpe_aix!check_disk4

Local nrpe.cfg entry:

command[check_disk4]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt

I even went so far to deinstall/reinstall the Nagios.rte fileset from the system, preserving the nrpe.cfg file.
Again, no effect on the alert on the Nagios Server side.

Both I and our Nagios Administrator are stymied as to what could be causing this issue...

Any help?

benjaminsmith · Post by **benjaminsmith** » Wed Jul 03, 2019 4:44 pm

Hello @coactmwp,

Thanks for uploading the screen shots and other relevant data. It looks like you are running the check command on the remote host as a root user. Let's try running the command as nagios users and compare the check results.

For example:

Code: Select all

su nagios
/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt

Then run the command on the remote host, as nagios user, from check_nrpe:

Code: Select all

./check_nrpe -H 127.0.0.1 -c check_disk4

Are the results the same as reported by the Nagios Server?

coactmwp · Post by **coactmwp** » Wed Jul 03, 2019 8:22 pm

@benjaminsmith

Thank you kindly for your prompt reply.

Here are the output results that you requested from our remote host.

First, running the check_disk command on the remote host as the nagios user:

hrmsdbp > /usr/local/nagios/etc # su nagios
hrmsdbp > /usr/local/nagios/etc # whoami
nagios
hrmsdbp > /usr/local/nagios/etc # /usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048

Next, running the check_nrpe command on the remote host as the nagios user:

hrmsdbp > /usr/local/nagios/etc # /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -c check_disk4
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048
hrmsdbp > /usr/local/nagios/etc # whoami
nagios
hrmsdbp > /usr/local/nagios/etc #

Unfortunately, no joy.
Both commands report DISK OK, even executed by the nagios user, so it is still a mystery to me and the Nagios Administrator here why the Nagios Server persists in maintaining the CRITICAL alert...

ssax · Post by **ssax** » Mon Jul 08, 2019 4:51 pm

What does this output?

Code: Select all

/usr/local/nagios/libexec/check_disk -w 90% -c 95% -p /dev/hd10opt

Post by **lmiltchev** » Mon Jul 08, 2019 5:03 pm

Does the state change if you forcefully re-schedule the next check of this service from the GUI? Can you post the service config, along with any relevant objects, i.e. a command and/or a template that this service is using?

coactmwp · Post by **coactmwp** » Mon Jul 08, 2019 6:03 pm

@ssax

Here is the output of the command you requested, executed as the nagios user on the affected system:

hrmsdbp > / # su nagios
hrmsdbp > / # whoami
nagios
hrmsdbp > / # /usr/local/nagios/libexec/check_disk -w 90% -c 95% -p /dev/hd10opt
DISK CRITICAL - free space: /opt 794 MB (38% inode=90%);| /opt=1253MB;204;102;0;2048
hrmsdbp > / #

... So that would explain why the server is listing the filesystem with a CRITICAL alert, but the above command is not the one that is referenced in the nrpe.cfg file, and the filesystem itself doesn't reflect what the above command indicates.
Here is the current 'df -g' output from the system:

hrmsdbp > / # df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 0.50 0.30 40% 3965 6% /
/dev/hd2 4.34 1.81 59% 45930 10% /usr
/dev/hd9var 2.00 1.80 11% 3534 1% /var
/dev/hd3 1.00 1.00 1% 56 1% /tmp
/dev/hd1 0.50 0.50 1% 121 1% /home
/dev/hd11admin 0.12 0.12 1% 7 1% /admin
/proc - - - - - /proc
/dev/hd10opt 2.00 0.78 62% 19973 10% /opt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/datalv 0.12 0.12 1% 64 1% /data
/dev/u01lv 68.00 34.92 49% 209097 3% /u01

The filesystem isn't indicating that it is 95% full, nor that there are only 5% inodes left for the filesystem...

What gives?

coactmwp · Post by **coactmwp** » Mon Jul 08, 2019 6:34 pm

lmiltchev wrote:Does the state change if you forcefully re-schedule the next check of this service from the GUI? Can you post the service config, along with any relevant objects, i.e. a command and/or a template that this service is using?

@lmitchev

This CRITICAL alert has been going on now for 28 days straight, and after the maintenance window reboot of the LPAR that I arranged for last week, I forced a re-check of the service from the Nagios Server GUI, and the CRITICAL alert never changed.

Please let me know what you mean by "service config" and "relevant objects", and I will post them in my next reply. I thought I had provided all the salient information about the Nagios configuration on the client LPAR in my original posting to this forum...

Post by **lmiltchev** » Tue Jul 09, 2019 4:59 pm

I was hoping to see the service definition of "Check /opt" service on "hrmsdbp" host. It's probably located in the /usr/local/nagios/etc/objects/ directory. What I meant by "relevant objects" was a service template and a check command that is used by this service, for example:

Code: Select all

define service {
    host_name                   hrmsdbp
    service_description         Check /opt
    use                         <some template>
    ...
    register                    1
}

coactmwp · Post by **coactmwp** » Mon Jul 15, 2019 3:42 pm

lmiltchev wrote:I was hoping to see the service definition of "Check /opt" service on "hrmsdbp" host. It's probably located in the /usr/local/nagios/etc/objects/ directory. What I meant by "relevant objects" was a service template and a check command that is used by this service, for example:
Code: Select all
define service {
    host_name                   hrmsdbp
    service_description         Check /opt
    use                         <some template>
    ...
    register                    1
}

@lmiltchev

My Nagios Administrator saw your note, and provided this information from off of our Nagios Server:

Code: Select all

define service {
       use hrmsdbp-host-service
       host_name hrmsdbp
       service_description Check /opt
       check_command check_nrpe_aix!check_disk4
       max_check_attempts      3
}
define service {
       name hrmsdbp-host-service
       use aix-service
       register 0
}

Does this help any? Is there any other information that you need?

coactmwp · Post by **coactmwp** » Mon Jul 15, 2019 3:46 pm

@lmiltchev

Also, I went looking on the local hosts, but since it is an AIX LPAR, there is no "objects" directory in the /usr/loca/nagios/etc directory on the local system:

hrmsdbp > /usr/local/nagios/etc # ls -l
total 48
-rw-r--r-- 1 nagios nagios 8615 Jul 03 10:39 nrpe.cfg
-rw-r--r-- 1 nagios nagios 8615 Jul 03 10:29 nrpe.cfg.orig
hrmsdbp > /usr/local/nagios/etc #

I hope this helps.

Nagios Support Forum

N Server shows CRITICAL alert, but check_disk is DISK OK

N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Re: N Server shows CRITICAL alert, but check_disk is DISK OK