more frequent false alarms in 2012R2.2

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
KiwiBloke
Posts: 81
Joined: Fri Apr 27, 2012 7:23 pm

more frequent false alarms in 2012R2.2

Post by KiwiBloke »

Hi,

We migrated to new 2012R2.2 based VMs approximately 5 months ago and We get far more false alarms with this version than the previous version.

We predominantly monitor windows servers using the NSClient++ along with some ESXi servers, and Cisco Switch monitoring with SNMP polling. We only seem to get false alarms with NSClint++ configured services.

an example of a false alarm would be:

Code: Select all

Nagios has detected a problem with this service.

Notification Type: PROBLEM

Service: Uptime
Host: psm4syslog1.fnz.com
Address: 192.168.227.47
State: CRITICAL
Info:
CRITICAL - Socket timeout after 10 seconds
Date/Time: 2013-10-20 21:35:05
The version of NSClient++ we are running is 0.3.9.328 (x64)

Perhaps we need to upgrade the client?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: more frequent false alarms in 2012R2.2

Post by slansing »

You should not have to update the client, for some reason the check is timing out. Does this service constantly time out now? Or does it return a valid check at some times?

If it always times out, I'd recommend adding a longer timeout range for the check to start with, and manually run it from the command line like so:

Code: Select all

/usr/local/nagios/libexec/check_nt -H windows.ip.addr -p 12489 -t 30 -v UPTIME
Note: "-t 30" is adding a timeout of 30 seconds.
KiwiBloke
Posts: 81
Joined: Fri Apr 27, 2012 7:23 pm

Re: more frequent false alarms in 2012R2.2

Post by KiwiBloke »

Hi,

it seems to flap. but it is across the board.

I will try them command as you say and let you know how we get on.

Cheers,

C.
KiwiBloke
Posts: 81
Joined: Fri Apr 27, 2012 7:23 pm

Re: more frequent false alarms in 2012R2.2

Post by KiwiBloke »

Hi,

So that was interesting. I checked our recent emails for a server that has been flapping with this error and targeted it and got the following response

Code: Select all

[root@psu4nagiosxi libexec]# ./check_nt -H 10.139.1.25 -p 12489 -t 30 -v UPTIME
CRITICAL - Socket timeout after 30 seconds
yet if i tried another server i got an answer

Code: Select all

[root@psu4nagiosxi libexec]# ./check_nt -H 10.137.1.21 -s <REDACTED> -p 12489 -t 30 -v UPTIME
System Uptime - 183 day(s) 21 hour(s) 23 minute(s)
[root@psu4nagiosxi libexec]#
if i tried the same syntax with the original server, i still got the same timeout error. I think I will need to take this to our network guys as it looks like Nagios is doing everything as it should be doing.

Cheers,

C.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: more frequent false alarms in 2012R2.2

Post by abrist »

KiwiBloke wrote:CRITICAL - Socket timeout after 30 seconds
I would guess it is one of the following issues:
1. Firewall issues
2. NSClient service not running
3. Incorrect password
4. Nagios server IP not declared in allowed hosts

Best of luck!
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
KiwiBloke
Posts: 81
Joined: Fri Apr 27, 2012 7:23 pm

Re: more frequent false alarms in 2012R2.2

Post by KiwiBloke »

Hi,

Minor breakthough.

We think its the version of NSClient++ we are running.

A colleague hga reason to run netstat on one of our monitored hosts (netstat -anb) and discovered over 20k TIME_WAIT connections on TCP 12489 to our nagios server..

I have repeated this on several other servers and found the same thing.

A google search found this: http://www.nsclient.org/nscp/discussion/topic/1142 which then lead to this: http://support.microsoft.com/kb/2553549

Most of our monitored hosts are Windows 2008 and we have been unable to get a change window for security patching for a long time (easily over 300 days) , so it seems all the servers that we have been having issues with havent been rebooted as part of any other work and so are hitting this issue.

I will need to take this up with my colleagues and press for that window!

Cheers,

C.

key words for other users with the same problem
socket timeout nsclient time_wait connections uptime
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: more frequent false alarms in 2012R2.2

Post by slansing »

Sounds good, let us know how it goes. This sounds like it very well may be the resolution to this particular problem.
KiwiBloke
Posts: 81
Joined: Fri Apr 27, 2012 7:23 pm

Re: more frequent false alarms in 2012R2.2

Post by KiwiBloke »

Hi,

I had a bunch of low risk, non platform servers that were showing the same issue and got approval to apply the hotfix and reboot.

I'm not able to say whether this has fixed the issue yet as we will need to wait another 300 days :) but so far at least i have not seen any connections with status = TIME_WAIT backing up from requests from the nagiosxi server and the logs files are under control.

Cheers
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: more frequent false alarms in 2012R2.2

Post by slansing »

Well.....alrighty! Let us know! You can take a look at NCPA in the interim! http://assets.nagios.com/downloads/ncpa ... g_NCPA.pdf
Locked