Page 1 of 1
more frequent false alarms in 2012R2.2
Posted: Sun Oct 20, 2013 1:54 pm
by KiwiBloke
Hi,
We migrated to new 2012R2.2 based VMs approximately 5 months ago and We get far more false alarms with this version than the previous version.
We predominantly monitor windows servers using the NSClient++ along with some ESXi servers, and Cisco Switch monitoring with SNMP polling. We only seem to get false alarms with NSClint++ configured services.
an example of a false alarm would be:
Code: Select all
Nagios has detected a problem with this service.
Notification Type: PROBLEM
Service: Uptime
Host: psm4syslog1.fnz.com
Address: 192.168.227.47
State: CRITICAL
Info:
CRITICAL - Socket timeout after 10 seconds
Date/Time: 2013-10-20 21:35:05
The version of NSClient++ we are running is 0.3.9.328 (x64)
Perhaps we need to upgrade the client?
Re: more frequent false alarms in 2012R2.2
Posted: Mon Oct 21, 2013 10:17 am
by slansing
You should not have to update the client, for some reason the check is timing out. Does this service constantly time out now? Or does it return a valid check at some times?
If it always times out, I'd recommend adding a longer timeout range for the check to start with, and manually run it from the command line like so:
Code: Select all
/usr/local/nagios/libexec/check_nt -H windows.ip.addr -p 12489 -t 30 -v UPTIME
Note: "-t 30" is adding a timeout of 30 seconds.
Re: more frequent false alarms in 2012R2.2
Posted: Tue Oct 22, 2013 5:22 pm
by KiwiBloke
Hi,
it seems to flap. but it is across the board.
I will try them command as you say and let you know how we get on.
Cheers,
C.
Re: more frequent false alarms in 2012R2.2
Posted: Tue Oct 22, 2013 5:41 pm
by KiwiBloke
Hi,
So that was interesting. I checked our recent emails for a server that has been flapping with this error and targeted it and got the following response
Code: Select all
[root@psu4nagiosxi libexec]# ./check_nt -H 10.139.1.25 -p 12489 -t 30 -v UPTIME
CRITICAL - Socket timeout after 30 seconds
yet if i tried another server i got an answer
Code: Select all
[root@psu4nagiosxi libexec]# ./check_nt -H 10.137.1.21 -s <REDACTED> -p 12489 -t 30 -v UPTIME
System Uptime - 183 day(s) 21 hour(s) 23 minute(s)
[root@psu4nagiosxi libexec]#
if i tried the same syntax with the original server, i still got the same timeout error. I think I will need to take this to our network guys as it looks like Nagios is doing everything as it should be doing.
Cheers,
C.
Re: more frequent false alarms in 2012R2.2
Posted: Wed Oct 23, 2013 10:01 am
by abrist
KiwiBloke wrote:CRITICAL - Socket timeout after 30 seconds
I would guess it is one of the following issues:
1. Firewall issues
2. NSClient service not running
3. Incorrect password
4. Nagios server IP not declared in allowed hosts
Best of luck!
Re: more frequent false alarms in 2012R2.2
Posted: Mon Oct 28, 2013 4:18 pm
by KiwiBloke
Hi,
Minor breakthough.
We think its the version of NSClient++ we are running.
A colleague hga reason to run netstat on one of our monitored hosts (netstat -anb) and discovered over 20k TIME_WAIT connections on TCP 12489 to our nagios server..
I have repeated this on several other servers and found the same thing.
A google search found this:
http://www.nsclient.org/nscp/discussion/topic/1142 which then lead to this:
http://support.microsoft.com/kb/2553549
Most of our monitored hosts are Windows 2008 and we have been unable to get a change window for security patching for a long time (easily over 300 days) , so it seems all the servers that we have been having issues with havent been rebooted as part of any other work and so are hitting this issue.
I will need to take this up with my colleagues and press for that window!
Cheers,
C.
key words for other users with the same problem
socket timeout nsclient time_wait connections uptime
Re: more frequent false alarms in 2012R2.2
Posted: Mon Oct 28, 2013 4:28 pm
by slansing
Sounds good, let us know how it goes. This sounds like it very well may be the resolution to this particular problem.
Re: more frequent false alarms in 2012R2.2
Posted: Wed Nov 06, 2013 10:40 pm
by KiwiBloke
Hi,
I had a bunch of low risk, non platform servers that were showing the same issue and got approval to apply the hotfix and reboot.
I'm not able to say whether this has fixed the issue yet as we will need to wait another 300 days

but so far at least i have not seen any connections with status = TIME_WAIT backing up from requests from the nagiosxi server and the logs files are under control.
Cheers
Re: more frequent false alarms in 2012R2.2
Posted: Thu Nov 07, 2013 10:23 am
by slansing
Well.....alrighty! Let us know! You can take a look at NCPA in the interim!
http://assets.nagios.com/downloads/ncpa ... g_NCPA.pdf