Page 1 of 3
NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 12:30 pm
by sievers
Hello,
we are having a very annoying problem. On random Servers with random Services we are receiving Socket Timeout errors. Here is what have already tried to remedy:
- Update the NSClient on the Server .......no help
- Increase the timeout value to 40 seconds.......no help (We even increased it to 100 seconds to test it, it still timedout)
- Restart the Nagios Server.....No help
There is no lag or latency between the Nagios Server and the monitored Server since they are all in the same ESX Farm in the same location and subnet. I triplechecked Firefall settings, but everything is ok
See Screenshot
nagioseeror.PNG
The problem is that the service recovers for a few minutes and then it starts over again, as you can see from the my mailbox below
nagioseeror2.PNG
Can you please help me, thanks
Our Version is Nagios XI 2012R1.6 on Centos 6.2 64bit
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 12:35 pm
by abrist
Are you sure this server is working correctly? Is the nsclient service hanging/restarting? (check the event logs)?
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 12:48 pm
by sievers
I checked both Server and NSClient multiple times. There are no errors, I am using the very latest NSclient++ version. In addition the whole setup was working fine for many months. It started with a single server and service. Now the error is widespred and affects most servers at random.
Do you need me to run any checks
Also it seems to be stuck at the same services per server, although each server has different services.
Also, can you tell me where to find the nsclient logs in windows, with 4.1 and 64 bit they are not in the nsclient++ directory
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 12:59 pm
by abrist
Lets check for a seg fault with check_nt:
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:04 pm
by sievers
Code: Select all
[root@pxdc001apnag1 ~]# grep seg /var/log/messages
Jan 13 18:07:31 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255
Jan 13 18:27:28 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255
[root@pxdc001apnag1 ~]#
Here is my NMAP output of one of the target hosts
Code: Select all
Starting Nmap 5.51 ( http://nmap.org ) at 2014-01-13 18:53 CET
Nmap scan report for 10.180.2.78
Host is up (0.00025s latency).
Not shown: 990 closed ports
PORT STATE SERVICE
21/tcp open ftp
80/tcp open http
135/tcp open msrpc
139/tcp open netbios-ssn
445/tcp open microsoft-ds
3389/tcp open ms-term-serv
5666/tcp open nrpe
49152/tcp open unknown
49153/tcp open unknown
49154/tcp open unknown
MAC Address: 00:50:56:8C:00:19 (VMware)
Currently this error is present on approx. 15 hosts. The Service seems to flap from socket timeout to Status ok every 60 - 300 seconds, back and forth
Below you can see a live snapshot of three (out of 9) servers with the error. It seems to affect services at random. Sometimes it stays the same service for hours before the error then wanders to another service. Also, it never affects all services, usually only 1-3 services per host
error1.PNG
error2.PNG
error3.PNG
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:21 pm
by abrist
Are these checks using nrpe or check_nt?
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:22 pm
by sievers
its 99% $USER1$/check_nt
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:28 pm
by abrist
check_nt uses port 12489 - I notice that the nmap you ran does not include this port . . . .
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:33 pm
by sievers
true, but all my other 99+ servers work, and I just did a cross check with NMAP on a few, and that port does not show up on any other servers as well. Also if the port was closed, it would affect all check_nt services right? Not just the 1-3
Also all Windows Firewalls are completely disabled accross the board
Re: NagiosXI Socket Timeout Issue
Posted: Mon Jan 13, 2014 1:36 pm
by slansing
Hmm, it almost looks like you have a mix of NRPE and check_nt checks running, can you show us the configurations for some of your checks that are timing out? Specifically the drive checks, or any checks that may be related to networked drives/hardware. To view the actual flat service configurations navigate to the following:
Configure > CCM > Services > click the diskette icon next to the 'service name' of one of the services, and copy the output } of that specific service configuration {