NagiosXI Socket Timeout Issue

sievers · Post by **sievers** » Mon Jan 13, 2014 12:30 pm

Hello,

we are having a very annoying problem. On random Servers with random Services we are receiving Socket Timeout errors. Here is what have already tried to remedy:

- Update the NSClient on the Server .......no help
- Increase the timeout value to 40 seconds.......no help (We even increased it to 100 seconds to test it, it still timedout)
- Restart the Nagios Server.....No help

There is no lag or latency between the Nagios Server and the monitored Server since they are all in the same ESX Farm in the same location and subnet. I triplechecked Firefall settings, but everything is ok

See Screenshot

nagioseeror.PNG

The problem is that the service recovers for a few minutes and then it starts over again, as you can see from the my mailbox below

nagioseeror2.PNG

Can you please help me, thanks

Our Version is Nagios XI 2012R1.6 on Centos 6.2 64bit

abrist · Post by **abrist** » Mon Jan 13, 2014 12:35 pm

Are you sure this server is working correctly? Is the nsclient service hanging/restarting? (check the event logs)?

sievers · Post by **sievers** » Mon Jan 13, 2014 12:48 pm

I checked both Server and NSClient multiple times. There are no errors, I am using the very latest NSclient++ version. In addition the whole setup was working fine for many months. It started with a single server and service. Now the error is widespred and affects most servers at random.

Do you need me to run any checks

Also it seems to be stuck at the same services per server, although each server has different services.

Also, can you tell me where to find the nsclient logs in windows, with 4.1 and 64 bit they are not in the nsclient++ directory

abrist · Post by **abrist** » Mon Jan 13, 2014 12:59 pm

Lets check for a seg fault with check_nt:

Code: Select all

grep seg /var/log/messages

sievers · Post by **sievers** » Mon Jan 13, 2014 1:04 pm

Code: Select all

[root@pxdc001apnag1 ~]# grep seg /var/log/messages                              
Jan 13 18:07:31 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255                                                          
Jan 13 18:27:28 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255                                                          
[root@pxdc001apnag1 ~]#

Here is my NMAP output of one of the target hosts

Code: Select all

Starting Nmap 5.51 ( http://nmap.org ) at 2014-01-13 18:53 CET                  
Nmap scan report for 10.180.2.78                                                
Host is up (0.00025s latency).                                                  
Not shown: 990 closed ports                                                     
PORT      STATE SERVICE                                                         
21/tcp    open  ftp                                                             
80/tcp    open  http                                                            
135/tcp   open  msrpc                                                           
139/tcp   open  netbios-ssn                                                     
445/tcp   open  microsoft-ds                                                    
3389/tcp  open  ms-term-serv                                                    
5666/tcp  open  nrpe                                                            
49152/tcp open  unknown                                                         
49153/tcp open  unknown                                                         
49154/tcp open  unknown                                                         
MAC Address: 00:50:56:8C:00:19 (VMware)

Currently this error is present on approx. 15 hosts. The Service seems to flap from socket timeout to Status ok every 60 - 300 seconds, back and forth

Below you can see a live snapshot of three (out of 9) servers with the error. It seems to affect services at random. Sometimes it stays the same service for hours before the error then wanders to another service. Also, it never affects all services, usually only 1-3 services per host

error1.PNG

error2.PNG

error3.PNG

abrist · Post by **abrist** » Mon Jan 13, 2014 1:21 pm

Are these checks using nrpe or check_nt?

sievers · Post by **sievers** » Mon Jan 13, 2014 1:22 pm

its 99% $USER1$/check_nt

abrist · Post by **abrist** » Mon Jan 13, 2014 1:28 pm

check_nt uses port 12489 - I notice that the nmap you ran does not include this port . . . .

sievers · Post by **sievers** » Mon Jan 13, 2014 1:33 pm

true, but all my other 99+ servers work, and I just did a cross check with NMAP on a few, and that port does not show up on any other servers as well. Also if the port was closed, it would affect all check_nt services right? Not just the 1-3

Also all Windows Firewalls are completely disabled accross the board

slansing · Post by **slansing** » Mon Jan 13, 2014 1:36 pm

Hmm, it almost looks like you have a mix of NRPE and check_nt checks running, can you show us the configurations for some of your checks that are timing out? Specifically the drive checks, or any checks that may be related to networked drives/hardware. To view the actual flat service configurations navigate to the following:

Configure > CCM > Services > click the diskette icon next to the 'service name' of one of the services, and copy the output } of that specific service configuration {

Nagios Support Forum

NagiosXI Socket Timeout Issue

NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue

Re: NagiosXI Socket Timeout Issue