Encountering problems with NRPE

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
ircgilbert
Posts: 4
Joined: Thu Jun 30, 2016 10:19 am

Encountering problems with NRPE

Post by ircgilbert »

Mods, please move if this is in the wrong location; it seemed the best place.

I'm encountering a new problem in a specific environment after years of flawless NRPE operation.

A customer has several CentOS 6 servers, and a small handful of CentOS 5 servers. Each of these is running with NRPE installed from the EPEL repository, the latest version available being 2.15:

Code: Select all

nrpe-2.15-7.el5.x86_64 : Host/service/network monitoring agent for Nagios
Repo        : installed
Intermittently, we see an unusual issue where the exit code for plugins switches, causing most/all of the NRPE checks associated with the host to go critical, despite no actual problems:

Normal:

Code: Select all

Jul  1 00:33:19 morpheus07 nrpe[17844]: Connection from (removed) port 9934
Jul  1 00:33:19 morpheus07 nrpe[17844]: Host address is in allowed_hosts
Jul  1 00:33:19 morpheus07 nrpe[17844]: Handling the connection...
Jul  1 00:33:19 morpheus07 nrpe[17844]: Host is asking for command 'check_procs' to be run...
Jul  1 00:33:19 morpheus07 nrpe[17844]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jul  1 00:33:19 morpheus07 nrpe[17844]: Command completed with return code 0 and output: PROCS OK: 221 processes
Jul  1 00:33:19 morpheus07 nrpe[17844]: Return Code: 0, Output: PROCS OK: 221 processes
Jul  1 00:33:19 morpheus07 nrpe[17844]: Connection from p▒▒5▒#177 closed.
Abnormal:

Code: Select all

Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from (removed) port 32398
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host address is in allowed_hosts
Jun 30 17:13:20 morpheus07 nrpe[3433]: Handling the connection...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Host is asking for command 'check_procs' to be run...
Jun 30 17:13:20 morpheus07 nrpe[3433]: Running command: /usr/lib64/nagios/plugins/check_procs -w 750 -c 1000
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Return Code: 2, Output: PROCS OK: 219 processes
Jun 30 17:13:20 morpheus07 nrpe[3433]: Connection from ▒&▒,▒#177 closed.
A more complete log snippet, shows that most checks flip over to having an exit code of 2, though the check_3ware command (which is a perl script, not a compiled executable) will flip back and forth:

Code: Select all

[root@morpheus07 ~]# cat /var/log/daemon | grep 'Jun 30 17:[01]' | grep return
Jun 30 17:01:26 morpheus07 nrpe[3148]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:02:44 morpheus07 nrpe[3172]: Command completed with return code 2 and output: OK - load average: 1.96, 1.97, 1.99|load1=1.960;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:03:08 morpheus07 nrpe[3186]: Command completed with return code 2 and output: DISK OK - free space: / 151072 MB (83% inode=99%);| /=30486MB;191195;191180;0;191275
Jun 30 17:03:20 morpheus07 nrpe[3192]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:03:39 morpheus07 nrpe[3200]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:04:42 morpheus07 nrpe[3220]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:06:26 morpheus07 nrpe[3260]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:07:44 morpheus07 nrpe[3297]: Command completed with return code 2 and output: OK - load average: 1.99, 1.97, 1.99|load1=1.990;30.000;40.000;0; load5=1.970;30.000;40.000;0; load15=1.990;30.000;40.000;0;
Jun 30 17:08:08 morpheus07 nrpe[3309]: Command completed with return code 2 and output: DISK OK - free space: / 151069 MB (83% inode=99%);| /=30489MB;191195;191180;0;191275
Jun 30 17:08:20 morpheus07 nrpe[3314]: Command completed with return code 2 and output: PROCS OK: 227 processes
Jun 30 17:09:41 morpheus07 nrpe[3340]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:10:41 morpheus07 nrpe[3372]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:11:26 morpheus07 nrpe[3388]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:12:44 morpheus07 nrpe[3415]: Command completed with return code 2 and output: OK - load average: 2.03, 2.03, 2.00|load1=2.030;30.000;40.000;0; load5=2.030;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:13:08 morpheus07 nrpe[3428]: Command completed with return code 2 and output: DISK OK - free space: / 151067 MB (83% inode=99%);| /=30491MB;191195;191180;0;191275
Jun 30 17:13:20 morpheus07 nrpe[3433]: Command completed with return code 2 and output: PROCS OK: 219 processes
Jun 30 17:15:41 morpheus07 nrpe[3485]: Command completed with return code 2 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:16:26 morpheus07 nrpe[3500]: Command completed with return code 2 and output: USERS OK - 0 users currently logged in |users=0;2;5;0
Jun 30 17:16:41 morpheus07 nrpe[3508]: Command completed with return code 0 and output: check_3ware.pl: OK (Unit 0 at Controller 0 is OK)
Jun 30 17:17:44 morpheus07 nrpe[3580]: Command completed with return code 0 and output: OK - load average: 2.00, 2.00, 2.00|load1=2.000;30.000;40.000;0; load5=2.000;30.000;40.000;0; load15=2.000;30.000;40.000;0;
Jun 30 17:18:08 morpheus07 nrpe[3593]: Command completed with return code 0 and output: DISK OK - free space: / 151064 MB (83% inode=99%);| /=30494MB;191195;191180;0;191275
Jun 30 17:18:20 morpheus07 nrpe[3598]: Command completed with return code 0 and output: PROCS OK: 225 processes
[root@morpheus07 ~]#
Note that at 17:17:17 the NRPE daemon was restarted by a user, and this restored checks to properly returning 0.

So far, this is what we've determined:
1) All compiled checks, and occasionally non-compiled checks, appear to suddenly start returning exit code 2, despite finding no issues.
2) Restarting the NRPE daemon has returned us to normal operation, but is not a permanent fix.
3) This appears to happen randomly; we haven't been able to associate any specific time or other event that triggers this occurring.
4) It impacts multiple instances of NRPE Daemon 2.15 on CentOS 5 and CentOS 6, but only for this one customer. Our internal systems and other customers have no such issues at all.

Has anyone seen anything like this before? I've searched all over, but if it is documented it is either hiding somewhere, or I'm not using the right search terms to uncover it.

On a side note, and possibly related, that last line in each segment of log looks weird. Shouldn't that be a port number or IP address being specifically noted as closed?

Thanks!
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Encountering problems with NRPE

Post by ssax »

Is this happening on multiple servers or just one? It's looks like memory issues to me but if it's happening on multiple that's very strange.

Can you look in your /var/log/messages when this first started occurring and see if you can see anything related? It might give an indication.


Thank you
ircgilbert
Posts: 4
Joined: Thu Jun 30, 2016 10:19 am

Re: Encountering problems with NRPE

Post by ircgilbert »

ssax wrote:Is this happening on multiple servers or just one? It's looks like memory issues to me but if it's happening on multiple that's very strange.
I thought I covered this in the original post, but to be clear:

We've confirmed this happening on five separate systems to date -- we're fairly certain there are more, we just haven't isolated it down to this as the root cause.

Four of them are CentOS 6 servers, one is CentOS 5.

Four of them are SuperMicro motherboards, one is a Dell.

One of them is running cPanel/WHM, the other four are not.

When you say it looks like memory issues, are you suggesting possibly a physical memory fault? I could see that making sense.
ssax wrote:Can you look in your /var/log/messages when this first started occurring and see if you can see anything related? It might give an indication.
I've checked a couple of servers, and all I can see around the time of the error was the NRPE restart, and the SNMP service that runs for some other tasks. The NRPE error itself dates back to the time of install, the SNMP service was introduced afterward.

To be clear, we've seen these issues on these specific servers since we first rolled out the monitoring setup to them, about a year ago. That would make a little more sense in terms of physical memory issues if they were otherwise undetected before this. I haven't noticed any other evidence of memory issues, and if that's the case then it would be the largest single instance of otherwise undetected physical memory failure in a single environment that I've ever seen. When we're talking about 5 confirmed servers with this error state (in an environment of 28) and possibly more unconfirmed, it starts to become less believable. Not impossible, just less believable.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Encountering problems with NRPE

Post by tgriep »

I don't think he meant physical memory issues, but something in memory was corrupted and restarting NRPE fixes it.
Can you edit the nrpe.cfg file on one of the servers and enable debugging?
Maybe that will give us better clues on what is happening.
Be sure to check out our Knowledgebase for helpful articles and solutions!
ircgilbert
Posts: 4
Joined: Thu Jun 30, 2016 10:19 am

Re: Encountering problems with NRPE

Post by ircgilbert »

Attached is the debug log for the hour the issue was last detected.

16:51:20 is the time of the first buggy response.

Nothing obvious in /var/log/messages for the time.

/var/log/secure shows a couple of failed SSH login attempts about 8 minutes prior, but seems unrelated.
Attachments
morpheus07.txt
Debug log
(51.57 KiB) Downloaded 233 times
ircgilbert
Posts: 4
Joined: Thu Jun 30, 2016 10:19 am

Re: Encountering problems with NRPE

Post by ircgilbert »

Just bumping this; any ideas on what could be causing this, or other things we can attempt to resolve this?

Alternately, any further detail we can provide that would be of assistance so that we can move this issue forward?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Encountering problems with NRPE

Post by tgriep »

Thanks for the log file, nothing in it to help out in the issue.
One thing you can try is to compile the source it on your server and see it that helps. It could be how the RPM was compiled causing the issue.

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/How-To-Configure-NRPE-and-Install-From-Source-with-Nagios-XI.pdf
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked