Encountering problems with NRPE
Posted: Thu Jun 30, 2016 11:09 am
Mods, please move if this is in the wrong location; it seemed the best place.
I'm encountering a new problem in a specific environment after years of flawless NRPE operation.
A customer has several CentOS 6 servers, and a small handful of CentOS 5 servers. Each of these is running with NRPE installed from the EPEL repository, the latest version available being 2.15:
Intermittently, we see an unusual issue where the exit code for plugins switches, causing most/all of the NRPE checks associated with the host to go critical, despite no actual problems:
Normal:
Abnormal:
A more complete log snippet, shows that most checks flip over to having an exit code of 2, though the check_3ware command (which is a perl script, not a compiled executable) will flip back and forth:
Note that at 17:17:17 the NRPE daemon was restarted by a user, and this restored checks to properly returning 0.
So far, this is what we've determined:
1) All compiled checks, and occasionally non-compiled checks, appear to suddenly start returning exit code 2, despite finding no issues.
2) Restarting the NRPE daemon has returned us to normal operation, but is not a permanent fix.
3) This appears to happen randomly; we haven't been able to associate any specific time or other event that triggers this occurring.
4) It impacts multiple instances of NRPE Daemon 2.15 on CentOS 5 and CentOS 6, but only for this one customer. Our internal systems and other customers have no such issues at all.
Has anyone seen anything like this before? I've searched all over, but if it is documented it is either hiding somewhere, or I'm not using the right search terms to uncover it.
On a side note, and possibly related, that last line in each segment of log looks weird. Shouldn't that be a port number or IP address being specifically noted as closed?
Thanks!
I'm encountering a new problem in a specific environment after years of flawless NRPE operation.
A customer has several CentOS 6 servers, and a small handful of CentOS 5 servers. Each of these is running with NRPE installed from the EPEL repository, the latest version available being 2.15:
Code: Select all
nrpe-2.15-7.el5.x86_64 : Host/service/network monitoring agent for Nagios
Repo : installedNormal:
Code: Select all
Jul 1 00:33:19 morpheus07 nrpe[17844]: Connection from (removed) port 9934
Jul 1 00:33:19 morpheus07 nrpe[17844]: Host address is in allowed_hosts
Jul 1 00:33:19 morpheus07 nrpe[17844]: Handling the connection...
Jul 1 00:33:19 morpheus07 nrpe[17844]: Host is asking for command 'check_procs' to be run...
Jul 1 00:33:19 morpheus07 nrpe[17844]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jul 1 00:33:19 morpheus07 nrpe[17844]: Command completed with return code 0 and output: PROCS OK: 221 processes
Jul 1 00:33:19 morpheus07 nrpe[17844]: Return Code: 0, Output: PROCS OK: 221 processes
Jul 1 00:33:19 morpheus07 nrpe[17844]: Connection from p▒▒5▒#177 closed.Code: Select all
Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from (removed) port 32398
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host address is in allowed_hosts
Jun 30 17:13:20 morpheus07 nrpe[3433]: Handling the connection...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host is asking for command 'check_procs' to be run...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Return Code: 2, Output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from ▒&▒,▒#177 closed.Code: Select all
[root@morpheus07 ~]# cat /var/log/daemon | grep 'Jun 30 17:[01]' | grep return
Jun 30 17:01:26 morpheus07 nrpe[3148]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:02:44 morpheus07 nrpe[3172]: Command completed with return code 2 and output: OK - load average: 1.96, 1.97, 1.99|load1=1.960;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:03:08 morpheus07 nrpe[3186]: Command completed with return code 2 and output: DISK OK - free space: / 151072 MB (83% inode=99%);| /=30486MB;191195;191180;0;191275
Jun 30 17:03:20 morpheus07 nrpe[3192]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:03:39 morpheus07 nrpe[3200]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:04:42 morpheus07 nrpe[3220]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:06:26 morpheus07 nrpe[3260]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:07:44 morpheus07 nrpe[3297]: Command completed with return code 2 and output: OK - load average: 1.99, 1.97, 1.99|load1=1.990;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:08:08 morpheus07 nrpe[3309]: Command completed with return code 2 and output: DISK OK - free space: / 151069 MB (83% inode=99%);| /=30489MB;191195;191180;0;191275
Jun 30 17:08:20 morpheus07 nrpe[3314]: Command completed with return code 2 and output: PROCS OK: 227 processes
Jun 30 17:09:41 morpheus07 nrpe[3340]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:10:41 morpheus07 nrpe[3372]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:11:26 morpheus07 nrpe[3388]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:12:44 morpheus07 nrpe[3415]: Command completed with return code 2 and output: OK - load average: 2.03, 2.03, 2.00|load1=2.030;30.000;40.000;0; load5=2.030;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:13:08 morpheus07 nrpe[3428]: Command completed with return code 2 and output: DISK OK - free space: / 151067 MB (83% inode=99%);| /=30491MB;191195;191180;0;191275
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:15:41 morpheus07 nrpe[3485]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:16:26 morpheus07 nrpe[3500]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:16:41 morpheus07 nrpe[3508]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:17:44 morpheus07 nrpe[3580]: Command completed with return code 0 and output: OK - load average: 2.00, 2.00, 2.00|load1=2.000;30.000;40.000;0; load5=2.000;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:18:08 morpheus07 nrpe[3593]: Command completed with return code 0 and output: DISK OK - free space: / 151064 MB (83% inode=99%);| /=30494MB;191195;191180;0;191275
Jun 30 17:18:20 morpheus07 nrpe[3598]: Command completed with return code 0 and output: PROCS OK: 225 processes
[root@morpheus07 ~]#So far, this is what we've determined:
1) All compiled checks, and occasionally non-compiled checks, appear to suddenly start returning exit code 2, despite finding no issues.
2) Restarting the NRPE daemon has returned us to normal operation, but is not a permanent fix.
3) This appears to happen randomly; we haven't been able to associate any specific time or other event that triggers this occurring.
4) It impacts multiple instances of NRPE Daemon 2.15 on CentOS 5 and CentOS 6, but only for this one customer. Our internal systems and other customers have no such issues at all.
Has anyone seen anything like this before? I've searched all over, but if it is documented it is either hiding somewhere, or I'm not using the right search terms to uncover it.
On a side note, and possibly related, that last line in each segment of log looks weird. Shouldn't that be a port number or IP address being specifically noted as closed?
Thanks!