Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I have a machine that runs nagios for a cluster of machines and they all are monitored via nrpe. It has been working fine and then all of sudden today for only one machine it is returning "Connection refused by host". I thought that there must be something wrong with the daemons on that machine, but everything looks fine and if I run the check from the command line it gives the expected output:
If you navigate to Services > "Service Name of one of your checks" > re-schedule check does it return a normal value? Try running this and your manual command from the command line at the same time and see if they both return valid info, or if they both show connection refused.
what turned out happening is that there were a bunch of lines in the nagios log that said:
Warning: The check of service 'service-name' on host 'host-name' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
so I set max_concurrent_checks to 84 (that was pretty arbitrary) and then the orphan messages stopped (and instead there are a bunch of messages about hitting the max number of checks), but that made the nrpe commands suddenly start working again.
I'm not entirely sure I understand all the cause and effect going on here, but I'm at least happy that the tests are working again!!
In general, no. However, there is one test that comes back in less than a second when everything is ok. But, if there is a problem with the filesystem it is checking, it will sometimes hit the timeout limit which is 60 seconds. That test is run on 20 filesystems that are mounted on 20 hosts. There didn't seem to be any filesystem errors when this problem started, but do you think that could be related?
20 hosts x 20 services = 400 checks with a possibility of 60 seconds each
If each of those hosts had one shared file system that is taking a long time, thats 20 checks minimum taken up for 60 seconds, then depending on your check intervals and especially if this causes a retry that might or might not take as long. I would say this absolutely could be a large part of the cause. I should note also, when one of those fails, it will also call an on-demand check of the host for that service, thus creating another 20 potential checks, although they should be quick in most cases.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Thanks for the feedback! I talked to the systems people and they agreed to a shorter timeout on those checks, so they shouldn't take more than 10 seconds now. We'll see if that makes things better.
One more question. Is the fork limit something that is only in nagios 3? I wasn't sure if that was something that changed with the worker processes in nagios 4.
This is the technical definition of what is going on now with Core Workers:
Core Workers: The process of performing checks is now handled by a lightweight core worker process.
- There are standard worker processes that are created when Core starts that stay running as long as Core is running. This eliminates at least one fork of Nagios Core when a check is performed and in many cases two forks, thus speeding up the checks.