Nagios version for AIX: 2.0.1.0
25 days ago, our Nagios Server issued a CRITICAL 'Check Disk' alert on the /opt (/dev/hd10opt) filesystem, indicating "DISK CRITICAL - free space: /opt 26 MB (1% inode=32%)"
'df' on the LPAR showed that the /opt filesystem appeared to be fine:
I thought the issue was over-consumed and unreleased filesystem inodes, so I scheduled a maintenance window to reboot the LPAR.hrmsdbp > / # df -g
Filesystem GB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 0.50 0.31 39% 3963 6% /
/dev/hd2 4.34 1.81 59% 45930 10% /usr
/dev/hd9var 2.00 1.80 10% 3530 1% /var
/dev/hd3 1.00 1.00 1% 59 1% /tmp
/dev/hd1 0.50 0.50 1% 121 1% /home
/dev/hd11admin 0.12 0.12 1% 7 1% /admin
/proc - - - - - /proc
/dev/hd10opt 2.00 0.78 62% 19963 10% /opt
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/datalv 0.12 0.12 1% 64 1% /data
The reboot of the LPAR had no effect on the alert on the Nagios Server.
To diagnose the problem, I ran the 'check_disk' executable local to the system, and got this:
The local 'check_disk' command returns "DISK OK" on the filesystem that the Nagios Server thinks has a problem.hrmsdbp > / # /usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
DISK OK - free space: /opt 795 MB (38% inode=90%);| /opt=1252MB;1843;1945;0;2048
I engaged our Nagios administrator, and she doubled-check that the correct command was referenced properly in the Nagios Server config for the LPAR:
Nagios Server:
Local nrpe.cfg entry:service_description Check /opt
check_command check_nrpe_aix!check_disk4
I even went so far to deinstall/reinstall the Nagios.rte fileset from the system, preserving the nrpe.cfg file.command[check_disk4]=/usr/local/nagios/libexec/check_disk -w 10% -c 5% -p /dev/hd10opt
Again, no effect on the alert on the Nagios Server side.
Both I and our Nagios Administrator are stymied as to what could be causing this issue...
Any help?