Page 1 of 2
Sporadic Timeouts on Windows
Posted: Thu Jul 03, 2014 9:24 am
by cs_nagcc
Over the past week or so, I've noticed sporadic "Socket Timeout after 10 seconds" errors on three of my Windows servers. They all run the same kind of software, but there is a fourth server also running the software that hasn't shown any issues. I also have about 20 other Windows servers being monitored that haven't had any timeout issues. When I try to run a test from the Nagios server to the Windows server, I get a response back about 1/3 of the time. The other 2/3 I receive the socket timeout error. I've tried increasing the timeout times, but it simply times out with that time limit. It almost seems like it is opening a connection to the server and leaving it open and when it attempts to run a check again, it fails. When I run a "netstat -a" on one of the Windows servers, there are a ton of ports in "TIME_WAIT" status from my Nagios server. Is there a reason? Could this be causing the problem?
Re: Sporadic Timeouts on Windows
Posted: Thu Jul 03, 2014 12:03 pm
by slansing
Hmm, sounds like a network based issue. How many active checks are you running to these windows servers and how often? It's possible that the checks are stacking up since they are all hitting the same port. Do you know of any other services on those systems that may be trying to use port 12489 or 5666? Or whatever you changed the default check_nt and check_nrpe ports to?
Re: Sporadic Timeouts on Windows
Posted: Thu Jul 03, 2014 12:27 pm
by cs_nagcc
There are currently 7 checks every 3 minutes. I think your theory of checks "stacking up" is exactly what is happening, but I'm just not sure why it started happening suddenly. The system has been running fine for over a year, and then I started to get these socket timeouts and I'm not sure why. No other processes are using the standard ports for Nagios, so I'm wondering if it is blocking itself.
Re: Sporadic Timeouts on Windows
Posted: Thu Jul 03, 2014 2:28 pm
by lmiltchev
Do you see anything in the nsclient.log that can shed some light on the cause of the problem?
Re: Sporadic Timeouts on Windows
Posted: Mon Jul 07, 2014 5:59 am
by cs_nagcc
Unfortunately not. I can see when the service returns to the "OK" state, but I don't see anything when it times out. Is there a way to see more verbose logs?
Re: Sporadic Timeouts on Windows
Posted: Mon Jul 07, 2014 4:50 pm
by lmiltchev
Under the [/settings/log] section, you can change this line:
to this:
then restart the nsclient++ service.
Re: Sporadic Timeouts on Windows
Posted: Tue Jul 08, 2014 7:25 am
by cs_nagcc
Thanks lmiltchev. I didn't see a "level" option in the log section of the ini file, but I added it in. Unfortunately it didn't seem to change anything in regards to what is logging. As a side note, I'm running an older version of the nsclient++ executable(0.3.9.327). I would like to blame the older version on this issue, but I have the same version running on my other Windows servers without issues.
Re: Sporadic Timeouts on Windows
Posted: Tue Jul 08, 2014 7:30 am
by eloyd
Are you running the Windows firewall service on these machines? If so, may be rate-limiting the number of connections allowed.
Re: Sporadic Timeouts on Windows
Posted: Tue Jul 08, 2014 7:42 am
by cs_nagcc
Nope, no firewall running on the servers.
Re: Sporadic Timeouts on Windows
Posted: Tue Jul 08, 2014 7:48 am
by eloyd
I can't say what it is but I'm pretty sure it's not Nagios. I'm guessing something on the Windows server(s) is blocking after too many open connections or too many TIME_WAIT or something similar. Are these four boxes exactly the same, used exactly the same amount, and always responding to the [approximately] exactly same number of requests, or are these three used more than the fourth?