NagiosXI Socket Timeout Issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
sievers
Posts: 48
Joined: Tue May 24, 2011 7:34 am

NagiosXI Socket Timeout Issue

Post by sievers »

Hello,

we are having a very annoying problem. On random Servers with random Services we are receiving Socket Timeout errors. Here is what have already tried to remedy:

- Update the NSClient on the Server .......no help
- Increase the timeout value to 40 seconds.......no help (We even increased it to 100 seconds to test it, it still timedout)
- Restart the Nagios Server.....No help

There is no lag or latency between the Nagios Server and the monitored Server since they are all in the same ESX Farm in the same location and subnet. I triplechecked Firefall settings, but everything is ok

See Screenshot
nagioseeror.PNG
The problem is that the service recovers for a few minutes and then it starts over again, as you can see from the my mailbox below
nagioseeror2.PNG
Can you please help me, thanks

Our Version is Nagios XI 2012R1.6 on Centos 6.2 64bit
You do not have the required permissions to view the files attached to this post.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: NagiosXI Socket Timeout Issue

Post by abrist »

Are you sure this server is working correctly? Is the nsclient service hanging/restarting? (check the event logs)?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
sievers
Posts: 48
Joined: Tue May 24, 2011 7:34 am

Re: NagiosXI Socket Timeout Issue

Post by sievers »

I checked both Server and NSClient multiple times. There are no errors, I am using the very latest NSclient++ version. In addition the whole setup was working fine for many months. It started with a single server and service. Now the error is widespred and affects most servers at random.

Do you need me to run any checks

Also it seems to be stuck at the same services per server, although each server has different services.

Also, can you tell me where to find the nsclient logs in windows, with 4.1 and 64 bit they are not in the nsclient++ directory
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: NagiosXI Socket Timeout Issue

Post by abrist »

Lets check for a seg fault with check_nt:

Code: Select all

grep seg /var/log/messages
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
sievers
Posts: 48
Joined: Tue May 24, 2011 7:34 am

Re: NagiosXI Socket Timeout Issue

Post by sievers »

Code: Select all

[root@pxdc001apnag1 ~]# grep seg /var/log/messages                              
Jan 13 18:07:31 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255                                                          
Jan 13 18:27:28 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255                                                          
[root@pxdc001apnag1 ~]#     
Here is my NMAP output of one of the target hosts

Code: Select all

Starting Nmap 5.51 ( http://nmap.org ) at 2014-01-13 18:53 CET                  
Nmap scan report for 10.180.2.78                                                
Host is up (0.00025s latency).                                                  
Not shown: 990 closed ports                                                     
PORT      STATE SERVICE                                                         
21/tcp    open  ftp                                                             
80/tcp    open  http                                                            
135/tcp   open  msrpc                                                           
139/tcp   open  netbios-ssn                                                     
445/tcp   open  microsoft-ds                                                    
3389/tcp  open  ms-term-serv                                                    
5666/tcp  open  nrpe                                                            
49152/tcp open  unknown                                                         
49153/tcp open  unknown                                                         
49154/tcp open  unknown                                                         
MAC Address: 00:50:56:8C:00:19 (VMware)   
Currently this error is present on approx. 15 hosts. The Service seems to flap from socket timeout to Status ok every 60 - 300 seconds, back and forth

Below you can see a live snapshot of three (out of 9) servers with the error. It seems to affect services at random. Sometimes it stays the same service for hours before the error then wanders to another service. Also, it never affects all services, usually only 1-3 services per host
error1.PNG
error2.PNG
error3.PNG
You do not have the required permissions to view the files attached to this post.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: NagiosXI Socket Timeout Issue

Post by abrist »

Are these checks using nrpe or check_nt?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
sievers
Posts: 48
Joined: Tue May 24, 2011 7:34 am

Re: NagiosXI Socket Timeout Issue

Post by sievers »

its 99% $USER1$/check_nt
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: NagiosXI Socket Timeout Issue

Post by abrist »

check_nt uses port 12489 - I notice that the nmap you ran does not include this port . . . .
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
sievers
Posts: 48
Joined: Tue May 24, 2011 7:34 am

Re: NagiosXI Socket Timeout Issue

Post by sievers »

true, but all my other 99+ servers work, and I just did a cross check with NMAP on a few, and that port does not show up on any other servers as well. Also if the port was closed, it would affect all check_nt services right? Not just the 1-3

Also all Windows Firewalls are completely disabled accross the board
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: NagiosXI Socket Timeout Issue

Post by slansing »

Hmm, it almost looks like you have a mix of NRPE and check_nt checks running, can you show us the configurations for some of your checks that are timing out? Specifically the drive checks, or any checks that may be related to networked drives/hardware. To view the actual flat service configurations navigate to the following:

Configure > CCM > Services > click the diskette icon next to the 'service name' of one of the services, and copy the output } of that specific service configuration {
Locked