Page 1 of 1

nrpe tests suddenly failing

Posted: Tue Nov 26, 2013 4:55 pm
by jssingh
I have a machine that runs nagios for a cluster of machines and they all are monitored via nrpe. It has been working fine and then all of sudden today for only one machine it is returning "Connection refused by host". I thought that there must be something wrong with the daemons on that machine, but everything looks fine and if I run the check from the command line it gives the expected output:

Code: Select all

-bash-3.2$ ./check_nrpe -H host-name -c check_load
OK - load average: 11.09, 11.81, 14.19|load1=11.090;200.000;200.000;0; load5=11.810;200.000;200.000;0; load15=14.190;200.000;200.000;0; 
-bash-3.2$
I checked /var/log/syslog and it is not logging anything for these failures.

Has anyone else seen nrpe fail suddenly like that?

thanks,
-janice

Re: nrpe tests suddenly failing

Posted: Wed Nov 27, 2013 12:41 pm
by slansing
If you navigate to Services > "Service Name of one of your checks" > re-schedule check does it return a normal value? Try running this and your manual command from the command line at the same time and see if they both return valid info, or if they both show connection refused.

Re: nrpe tests suddenly failing

Posted: Wed Nov 27, 2013 2:56 pm
by jssingh
what turned out happening is that there were a bunch of lines in the nagios log that said:
Warning: The check of service 'service-name' on host 'host-name' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
so I set max_concurrent_checks to 84 (that was pretty arbitrary) and then the orphan messages stopped (and instead there are a bunch of messages about hitting the max number of checks), but that made the nrpe commands suddenly start working again.

I'm not entirely sure I understand all the cause and effect going on here, but I'm at least happy that the tests are working again!!

thanks,
-janice

Re: nrpe tests suddenly failing

Posted: Wed Nov 27, 2013 3:48 pm
by slansing
Great! It looks like you were reaching a default max check forks limit, do you have checks that take more than 10 seconds on average to complete?

Re: nrpe tests suddenly failing

Posted: Wed Nov 27, 2013 4:00 pm
by jssingh
In general, no. However, there is one test that comes back in less than a second when everything is ok. But, if there is a problem with the filesystem it is checking, it will sometimes hit the timeout limit which is 60 seconds. That test is run on 20 filesystems that are mounted on 20 hosts. There didn't seem to be any filesystem errors when this problem started, but do you think that could be related?

Re: nrpe tests suddenly failing

Posted: Sat Nov 30, 2013 11:23 am
by sreinhardt
So if I am understanding correctly:

20 hosts x 20 services = 400 checks with a possibility of 60 seconds each

If each of those hosts had one shared file system that is taking a long time, thats 20 checks minimum taken up for 60 seconds, then depending on your check intervals and especially if this causes a retry that might or might not take as long. I would say this absolutely could be a large part of the cause. I should note also, when one of those fails, it will also call an on-demand check of the host for that service, thus creating another 20 potential checks, although they should be quick in most cases.

Re: nrpe tests suddenly failing

Posted: Tue Dec 03, 2013 6:57 pm
by jssingh
Thanks for the feedback! I talked to the systems people and they agreed to a shorter timeout on those checks, so they shouldn't take more than 10 seconds now. We'll see if that makes things better.

One more question. Is the fork limit something that is only in nagios 3? I wasn't sure if that was something that changed with the worker processes in nagios 4.

Re: nrpe tests suddenly failing

Posted: Wed Dec 04, 2013 11:07 am
by slansing
This is the technical definition of what is going on now with Core Workers:
Core Workers: The process of performing checks is now handled by a lightweight core worker process.
- There are standard worker processes that are created when Core starts that stay running as long as Core is running. This eliminates at least one fork of Nagios Core when a check is performed and in many cases two forks, thus speeding up the checks.