I'm encountering a new problem in a specific environment after years of flawless NRPE operation.
A customer has several CentOS 6 servers, and a small handful of CentOS 5 servers. Each of these is running with NRPE installed from the EPEL repository, the latest version available being 2.15:
Code: Select all
nrpe-2.15-7.el5.x86_64 : Host/service/network monitoring agent for Nagios
Repo : installedNormal:
Code: Select all
Jul 1 00:33:19 morpheus07 nrpe[17844]: Connection from (removed) port 9934
Jul 1 00:33:19 morpheus07 nrpe[17844]: Host address is in allowed_hosts
Jul 1 00:33:19 morpheus07 nrpe[17844]: Handling the connection...
Jul 1 00:33:19 morpheus07 nrpe[17844]: Host is asking for command 'check_procs' to be run...
Jul 1 00:33:19 morpheus07 nrpe[17844]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jul 1 00:33:19 morpheus07 nrpe[17844]: Command completed with return code 0 and output: PROCS OK: 221 processes
Jul 1 00:33:19 morpheus07 nrpe[17844]: Return Code: 0, Output: PROCS OK: 221 processes
Jul 1 00:33:19 morpheus07 nrpe[17844]: Connection from p▒▒5▒#177 closed.Code: Select all
Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from (removed) port 32398
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host address is in allowed_hosts
Jun 30 17:13:20 morpheus07 nrpe[3433]: Handling the connection...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host is asking for command 'check_procs' to be run...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Return Code: 2, Output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from ▒&▒,▒#177 closed.Code: Select all
[root@morpheus07 ~]# cat /var/log/daemon | grep 'Jun 30 17:[01]' | grep return
Jun 30 17:01:26 morpheus07 nrpe[3148]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:02:44 morpheus07 nrpe[3172]: Command completed with return code 2 and output: OK - load average: 1.96, 1.97, 1.99|load1=1.960;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:03:08 morpheus07 nrpe[3186]: Command completed with return code 2 and output: DISK OK - free space: / 151072 MB (83% inode=99%);| /=30486MB;191195;191180;0;191275
Jun 30 17:03:20 morpheus07 nrpe[3192]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:03:39 morpheus07 nrpe[3200]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:04:42 morpheus07 nrpe[3220]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:06:26 morpheus07 nrpe[3260]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:07:44 morpheus07 nrpe[3297]: Command completed with return code 2 and output: OK - load average: 1.99, 1.97, 1.99|load1=1.990;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:08:08 morpheus07 nrpe[3309]: Command completed with return code 2 and output: DISK OK - free space: / 151069 MB (83% inode=99%);| /=30489MB;191195;191180;0;191275
Jun 30 17:08:20 morpheus07 nrpe[3314]: Command completed with return code 2 and output: PROCS OK: 227 processes
Jun 30 17:09:41 morpheus07 nrpe[3340]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:10:41 morpheus07 nrpe[3372]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:11:26 morpheus07 nrpe[3388]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:12:44 morpheus07 nrpe[3415]: Command completed with return code 2 and output: OK - load average: 2.03, 2.03, 2.00|load1=2.030;30.000;40.000;0; load5=2.030;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:13:08 morpheus07 nrpe[3428]: Command completed with return code 2 and output: DISK OK - free space: / 151067 MB (83% inode=99%);| /=30491MB;191195;191180;0;191275
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:15:41 morpheus07 nrpe[3485]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:16:26 morpheus07 nrpe[3500]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:16:41 morpheus07 nrpe[3508]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:17:44 morpheus07 nrpe[3580]: Command completed with return code 0 and output: OK - load average: 2.00, 2.00, 2.00|load1=2.000;30.000;40.000;0; load5=2.000;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:18:08 morpheus07 nrpe[3593]: Command completed with return code 0 and output: DISK OK - free space: / 151064 MB (83% inode=99%);| /=30494MB;191195;191180;0;191275
Jun 30 17:18:20 morpheus07 nrpe[3598]: Command completed with return code 0 and output: PROCS OK: 225 processes
[root@morpheus07 ~]#So far, this is what we've determined:
1) All compiled checks, and occasionally non-compiled checks, appear to suddenly start returning exit code 2, despite finding no issues.
2) Restarting the NRPE daemon has returned us to normal operation, but is not a permanent fix.
3) This appears to happen randomly; we haven't been able to associate any specific time or other event that triggers this occurring.
4) It impacts multiple instances of NRPE Daemon 2.15 on CentOS 5 and CentOS 6, but only for this one customer. Our internal systems and other customers have no such issues at all.
Has anyone seen anything like this before? I've searched all over, but if it is documented it is either hiding somewhere, or I'm not using the right search terms to uncover it.
On a side note, and possibly related, that last line in each segment of log looks weird. Shouldn't that be a port number or IP address being specifically noted as closed?
Thanks!