Problems with check_nrpe and timeouts

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
gimmer
Posts: 4
Joined: Fri Aug 19, 2011 12:21 pm

Problems with check_nrpe and timeouts

Post by gimmer »

Hello,

I've seen a lot of documentation about various timeout messages from check_nrpe and check_http. However, I can't find anything relating to my problem.

I have Nagios Core installed on a CentOS 5.6 64bit VM, running on an ESX server.

I have 27 hosts configured. Each host averages about 5 checks, two http, and three nrpe.

Two of the hosts (Call them A and B) keep timing out with all of their checks.

check_nrpe returns the error: "Timeout while attempting connection". Trying to run the check manually from the server's command line also times out. I have the timeout value for check_nrpe set to 30 seconds. But I've set it as long as ten minutes (with setting the global timeout that long) and it still fails after ten minutes. Trying to run the check manually from another server (a CentOS 5.6 64bit VM running on the exact same ESX server) works flawlessly.

Check_http returns the error: "Connection timed out: HTTP CRITICAL - Unable to open TCP socket". Running it manually from the command line returns the same error. I've also tried extending this check's timeout. Trying it manually from my other VM on the same ESX server returns fine.

This does not happen 100% of the time for these hosts! Every few minutes, the checks manage to land, and the services all recover according to Nagios. It's worth noting that the servers are not going offline: my nagios installation is currently redundant to the monitoring our hosting provider provides, which hasn't alerted in months.

The remaining hosts (C, D, etc.) are fine. Their monitors are spaced out. They occasionally have a timeout, but immediately recover. The commands are the same for all the servers (I've just put multiple hosts into the host name so each one can run, and used macros to populate the IP addresses). Our hosting provider shows nothing out the ordinary in terms of these hosts.

Any information you can provide would be fantastic
gimmer
Posts: 4
Joined: Fri Aug 19, 2011 12:21 pm

Re: Problems with check_nrpe and timeouts

Post by gimmer »

Additional update:

I migrated the nagios install off the ESX server and on to a physical box, and the same problem is occurring. Strangely enough, it's the same two hosts.

I'm starting to think the hosting company is at fault
Locked