How do I diagnose 'socket timeout' issues?

globalive.nagios · Post by **globalive.nagios** » Sun Dec 18, 2011 9:29 am

We've had these alerts (CHECK_NRPE: Socket timeout after 30 seconds.) come up, usually over the weekend, and I'd like to know how I can determine whether or not it is serious.

For example:
1. A server starts sending 'socket timeout after 30 seconds' (we'll assume there is a valid issue causing this), and continues to send every 10 minutes.
2. Other servers (usually ones with lots of checks, and usually at night) will send the same messages, but when I log in to the console and 'schedule check immediately' the alert clears right away.

Scenario 1 indicates a valid issue, but the error message only conveys that there is something wrong with the NSclient service.
- This is currently happening to a server, but seeing as how other issues are coming up (i.e. no RDP), it is probably legit, and not just a service being overloaded.
- The NRPE socket timeout is happening to all services on this host.

Scenario 2 appears to indicate the same issue, but when checked into, nothing is wrong and the alert clears as soon as I force a check.
- This happened to 3 servers (ones that have lots of checks) over a weekend, and only by increasing the vCPU count on the Nagios XI server was I able to clear up the problem.
- The NRPE socket timeout is happening to only some services on these hosts, and the affected services change - not the same ones all the time.
- It has also happened to a few URL checks (only one check per host), but those usually clear up quickly.

So, in both scenarios the error message is the same, but the root problem is not. Does that make sense?

I need to know how to get Nagios to differentiate between itself getting overloaded and a server legitimately being down/broken, if for no other reason than to reduce the amount of spam Nagios sends our NOC team.

globalive.nagios · Post by **globalive.nagios** » Mon Dec 19, 2011 8:12 am

Update: Had another server this morning report one service with the socket timeout error (server has five services being monitored). After two timeout messages, I forced a check, and it immediately came up green, and has been green since then.

To clarify, this has only happened to 5 of the 55 hosts currently being monitored. Server load levels are low since adding the second vCPU.

agriffin · Post by **agriffin** » Mon Dec 19, 2011 12:32 pm

So these are Windows servers monitored using NSClient++'s NRPE support? Can you try pinging your Nagios server from the affected hosts and seeing how long it takes to resolve the host?

globalive.nagios · Post by **globalive.nagios** » Mon Dec 19, 2011 1:18 pm

Yes, Windows servers (2000 & 2003).

That was one of the things I tried before - I even changed the nsclient.ini file to point at the IP instead of DNS name, seemed no change. DNS response time seemed fine when I was troubleshooting.

agriffin · Post by **agriffin** » Mon Dec 19, 2011 2:25 pm

You could try increasing the timeout for the check_nrpe command. In the Core Config Manager, find the check_nrpe command and change the definition from:

Code: Select all

$USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ $ARG2$

to:

Code: Select all

$USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$ $ARG2$

Also, you could increase the max check attempts for each service you're having issues with. You can find this by navigating to the service in the Core Config Manager; it will be located under the Check Settings tab and will probably be blank since by default it's set using a template. This would mean Nagios would check multiple times to make sure that there is actually an issue.

globalive.nagios · Post by **globalive.nagios** » Mon Dec 19, 2011 2:30 pm

Good point, I will look into that.

However, since this is only happening to a small subset of hosts (all under the same department, I might add), could there be something else we should be looking at? I would figure it was a global issue if it were happening to all hosts, but this is 10% of what we monitor. Am I figuring wrong here?

agriffin · Post by **agriffin** » Mon Dec 19, 2011 2:52 pm

My guess is also that there is something else going on here with your network, but it's probably very environment-specific. If all the problematic hosts are from the same department then there's a good chance that their network is set up differently somehow and making things hard for Nagios. I would suggest contacting whoever is in charge of that, or looking into just the general network setup over there. It's difficult to troubleshoot these kinds of issues with so little to go on.

globalive.nagios · Post by **globalive.nagios** » Mon Dec 19, 2011 3:03 pm

Okie dokie, I'll look more closely into DNS issues next time something comes up, and talk to the department about any peculiarities of their network. That should go over well.

Thanks!

globalive.nagios · Post by **globalive.nagios** » Tue Jan 03, 2012 1:41 pm

Update on this: It still happens at random on 3-4 hosts, but if you run a 'schedule an immediate check' the error clears immediately.

Additionally, when I viewed the host that just went down, I saw this:

Code: Select all

Notifications for this host are being suppressed because it was detected as having been flapping between different states (21.4% change > 20.0% threshold). When the host state stabilizes and the flapping stops, notifications will be re-enabled.

I checked pinging the DNS name from the Nagios VM, and it was fine. I did not get a chance to test on the host itself because things were resolved so quickly.

Further, the load on the Nagios VM is quite low (load average: 0.10, 0.05, 0.01), so I don't think just increasing the timeout values will resolve this. (unless they correlate more to the client timeouts?)

globalive.nagios · Post by **globalive.nagios** » Wed Jan 04, 2012 2:25 pm

Another host with only one service reporting socket timeout. I immediately logged on to the host and pinged the Nagios hostname specified in the config - no resolution issues. Further, the other two services assigned to this host are showing just fine, and 'schedule immediate check' works as usual.

After a few minutes the issue cleared up on its own accord.

One other note...I don't see any notifications sent out in the log (maybe just a config issue...?).

Nagios Support Forum

How do I diagnose 'socket timeout' issues?

How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?

Re: How do I diagnose 'socket timeout' issues?