N Server shows CRITICAL alert, but check_disk is DISK OK

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

Environment: AIX LPAR, running AIX 7.1_TL04_SP03
Nagios version for AIX: 2.0.1.0

25 days ago, our Nagios Server issued a CRITICAL 'Check Disk' alert on the /opt (/dev/hd10opt) filesystem, indicating "DISK CRITICAL - free space: /opt 26 MB (1% inode=32%)"

'df' on the LPAR showed that the /opt filesystem appeared to be fine:
hrmsdbp > / # df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 0.50 0.31 39% 3963 6% /
/dev/hd2 4.34 1.81 59% 45930 10% /usr
/dev/hd9var 2.00 1.80 10% 3530 1% /var
/dev/hd3 1.00 1.00 1% 59 1% /tmp
/dev/hd1 0.50 0.50 1% 121 1% /home
/dev/hd11admin 0.12 0.12 1% 7 1% /admin
/proc - - - - - /proc
/dev/hd10opt 2.00 0.78 62% 19963 10% /opt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/datalv 0.12 0.12 1% 64 1% /data
I thought the issue was over-consumed and unreleased filesystem inodes, so I scheduled a maintenance window to reboot the LPAR.
The reboot of the LPAR had no effect on the alert on the Nagios Server.

To diagnose the problem, I ran the 'check_disk' executable local to the system, and got this:
hrmsdbp > / # /usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048
The local 'check_disk' command returns "DISK OK" on the filesystem that the Nagios Server thinks has a problem.

I engaged our Nagios administrator, and she doubled-check that the correct command was referenced properly in the Nagios Server config for the LPAR:
Nagios Server:
service_description Check /opt
check_command check_nrpe_aix!check_disk4
Local nrpe.cfg entry:
command[check_disk4]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
I even went so far to deinstall/reinstall the Nagios.rte fileset from the system, preserving the nrpe.cfg file.
Again, no effect on the alert on the Nagios Server side.

Both I and our Nagios Administrator are stymied as to what could be causing this issue...

Any help?
Attachments
Capture of the return from local 'check_disk execution
Capture of the return from local 'check_disk execution
Nagios_local_check_disk.PNG (5.14 KiB) Viewed 3271 times
'df' capture from LPAR
'df' capture from LPAR
Capture graphic of Nagios Server alert
Capture graphic of Nagios Server alert
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by benjaminsmith »

Hello @coactmwp,

Thanks for uploading the screen shots and other relevant data. It looks like you are running the check command on the remote host as a root user. Let's try running the command as nagios users and compare the check results.

For example:

Code: Select all

su nagios
/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
Then run the command on the remote host, as nagios user, from check_nrpe:

Code: Select all

./check_nrpe -H 127.0.0.1 -c check_disk4
Are the results the same as reported by the Nagios Server?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

@benjaminsmith

Thank you kindly for your prompt reply.

Here are the output results that you requested from our remote host.

First, running the check_disk command on the remote host as the nagios user:
hrmsdbp > /usr/local/nagios/etc # su nagios
hrmsdbp > /usr/local/nagios/etc # whoami
nagios
hrmsdbp > /usr/local/nagios/etc # /usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048
Next, running the check_nrpe command on the remote host as the nagios user:
hrmsdbp > /usr/local/nagios/etc # /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -c check_disk4
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048
hrmsdbp > /usr/local/nagios/etc # whoami
nagios
hrmsdbp > /usr/local/nagios/etc #
Unfortunately, no joy.
Both commands report DISK OK, even executed by the nagios user, so it is still a mystery to me and the Nagios Administrator here why the Nagios Server persists in maintaining the CRITICAL alert...
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by ssax »

What does this output?

Code: Select all

/usr/local/nagios/libexec/check_disk -w 90% -c 95% -p /dev/hd10opt
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by lmiltchev »

Does the state change if you forcefully re-schedule the next check of this service from the GUI? Can you post the service config, along with any relevant objects, i.e. a command and/or a template that this service is using?
Be sure to check out our Knowledgebase for helpful articles and solutions!
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

@ssax

Here is the output of the command you requested, executed as the nagios user on the affected system:
hrmsdbp > / # su nagios
hrmsdbp > / # whoami
nagios
hrmsdbp > / # /usr/local/nagios/libexec/check_disk -w 90% -c 95% -p /dev/hd10opt
DISK CRITICAL - free space: /opt 794 MB (38% inode=90%);| /opt=1253MB;204;102;0;2048
hrmsdbp > / #
... So that would explain why the server is listing the filesystem with a CRITICAL alert, but the above command is not the one that is referenced in the nrpe.cfg file, and the filesystem itself doesn't reflect what the above command indicates.
Here is the current 'df -g' output from the system:
hrmsdbp > / # df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 0.50 0.30 40% 3965 6% /
/dev/hd2 4.34 1.81 59% 45930 10% /usr
/dev/hd9var 2.00 1.80 11% 3534 1% /var
/dev/hd3 1.00 1.00 1% 56 1% /tmp
/dev/hd1 0.50 0.50 1% 121 1% /home
/dev/hd11admin 0.12 0.12 1% 7 1% /admin
/proc - - - - - /proc
/dev/hd10opt 2.00 0.78 62% 19973 10% /opt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/datalv 0.12 0.12 1% 64 1% /data
/dev/u01lv 68.00 34.92 49% 209097 3% /u01
The filesystem isn't indicating that it is 95% full, nor that there are only 5% inodes left for the filesystem...

What gives?
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

lmiltchev wrote:Does the state change if you forcefully re-schedule the next check of this service from the GUI? Can you post the service config, along with any relevant objects, i.e. a command and/or a template that this service is using?
@lmitchev

This CRITICAL alert has been going on now for 28 days straight, and after the maintenance window reboot of the LPAR that I arranged for last week, I forced a re-check of the service from the Nagios Server GUI, and the CRITICAL alert never changed.

Please let me know what you mean by "service config" and "relevant objects", and I will post them in my next reply. I thought I had provided all the salient information about the Nagios configuration on the client LPAR in my original posting to this forum...
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by lmiltchev »

I was hoping to see the service definition of "Check /opt" service on "hrmsdbp" host. It's probably located in the /usr/local/nagios/etc/objects/ directory. What I meant by "relevant objects" was a service template and a check command that is used by this service, for example:

Code: Select all

define service {
    host_name                   hrmsdbp
    service_description         Check /opt
    use                         <some template>
    ...
    register                    1
}
Be sure to check out our Knowledgebase for helpful articles and solutions!
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

lmiltchev wrote:I was hoping to see the service definition of "Check /opt" service on "hrmsdbp" host. It's probably located in the /usr/local/nagios/etc/objects/ directory. What I meant by "relevant objects" was a service template and a check command that is used by this service, for example:

Code: Select all

define service {
    host_name                   hrmsdbp
    service_description         Check /opt
    use                         <some template>
    ...
    register                    1
}
@lmiltchev

My Nagios Administrator saw your note, and provided this information from off of our Nagios Server:

Code: Select all

define service {
       use hrmsdbp-host-service
       host_name hrmsdbp
       service_description Check /opt
       check_command check_nrpe_aix!check_disk4
       max_check_attempts      3
}
define service {
       name hrmsdbp-host-service
       use aix-service
       register 0
}
Does this help any? Is there any other information that you need?
coactmwp
Posts: 6
Joined: Wed Jul 03, 2019 1:54 pm

Re: N Server shows CRITICAL alert, but check_disk is DISK OK

Post by coactmwp »

@lmiltchev

Also, I went looking on the local hosts, but since it is an AIX LPAR, there is no "objects" directory in the /usr/loca/nagios/etc directory on the local system:
hrmsdbp > /usr/local/nagios/etc # ls -l
total 48
-rw-r--r-- 1 nagios nagios 8615 Jul 03 10:39 nrpe.cfg
-rw-r--r-- 1 nagios nagios 8615 Jul 03 10:29 nrpe.cfg.orig
hrmsdbp > /usr/local/nagios/etc #
I hope this helps.
Locked