NagiosXI Socket Timeout Issue
NagiosXI Socket Timeout Issue
Hello,
we are having a very annoying problem. On random Servers with random Services we are receiving Socket Timeout errors. Here is what have already tried to remedy:
- Update the NSClient on the Server .......no help
- Increase the timeout value to 40 seconds.......no help (We even increased it to 100 seconds to test it, it still timedout)
- Restart the Nagios Server.....No help
There is no lag or latency between the Nagios Server and the monitored Server since they are all in the same ESX Farm in the same location and subnet. I triplechecked Firefall settings, but everything is ok
See Screenshot The problem is that the service recovers for a few minutes and then it starts over again, as you can see from the my mailbox below Can you please help me, thanks
Our Version is Nagios XI 2012R1.6 on Centos 6.2 64bit
we are having a very annoying problem. On random Servers with random Services we are receiving Socket Timeout errors. Here is what have already tried to remedy:
- Update the NSClient on the Server .......no help
- Increase the timeout value to 40 seconds.......no help (We even increased it to 100 seconds to test it, it still timedout)
- Restart the Nagios Server.....No help
There is no lag or latency between the Nagios Server and the monitored Server since they are all in the same ESX Farm in the same location and subnet. I triplechecked Firefall settings, but everything is ok
See Screenshot The problem is that the service recovers for a few minutes and then it starts over again, as you can see from the my mailbox below Can you please help me, thanks
Our Version is Nagios XI 2012R1.6 on Centos 6.2 64bit
You do not have the required permissions to view the files attached to this post.
Re: NagiosXI Socket Timeout Issue
Are you sure this server is working correctly? Is the nsclient service hanging/restarting? (check the event logs)?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: NagiosXI Socket Timeout Issue
I checked both Server and NSClient multiple times. There are no errors, I am using the very latest NSclient++ version. In addition the whole setup was working fine for many months. It started with a single server and service. Now the error is widespred and affects most servers at random.
Do you need me to run any checks
Also it seems to be stuck at the same services per server, although each server has different services.
Also, can you tell me where to find the nsclient logs in windows, with 4.1 and 64 bit they are not in the nsclient++ directory
Do you need me to run any checks
Also it seems to be stuck at the same services per server, although each server has different services.
Also, can you tell me where to find the nsclient logs in windows, with 4.1 and 64 bit they are not in the nsclient++ directory
Re: NagiosXI Socket Timeout Issue
Lets check for a seg fault with check_nt:
Code: Select all
grep seg /var/log/messagesFormer Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: NagiosXI Socket Timeout Issue
Code: Select all
[root@pxdc001apnag1 ~]# grep seg /var/log/messages
Jan 13 18:07:31 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255
Jan 13 18:27:28 pxdc001apnag1 kernel: PCI: MCFG configuration 0: base e0000000 s
egment 0 buses 0 - 255
[root@pxdc001apnag1 ~]# Code: Select all
Starting Nmap 5.51 ( http://nmap.org ) at 2014-01-13 18:53 CET
Nmap scan report for 10.180.2.78
Host is up (0.00025s latency).
Not shown: 990 closed ports
PORT STATE SERVICE
21/tcp open ftp
80/tcp open http
135/tcp open msrpc
139/tcp open netbios-ssn
445/tcp open microsoft-ds
3389/tcp open ms-term-serv
5666/tcp open nrpe
49152/tcp open unknown
49153/tcp open unknown
49154/tcp open unknown
MAC Address: 00:50:56:8C:00:19 (VMware) Below you can see a live snapshot of three (out of 9) servers with the error. It seems to affect services at random. Sometimes it stays the same service for hours before the error then wanders to another service. Also, it never affects all services, usually only 1-3 services per host
You do not have the required permissions to view the files attached to this post.
Re: NagiosXI Socket Timeout Issue
Are these checks using nrpe or check_nt?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: NagiosXI Socket Timeout Issue
its 99% $USER1$/check_nt
Re: NagiosXI Socket Timeout Issue
check_nt uses port 12489 - I notice that the nmap you ran does not include this port . . . .
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: NagiosXI Socket Timeout Issue
true, but all my other 99+ servers work, and I just did a cross check with NMAP on a few, and that port does not show up on any other servers as well. Also if the port was closed, it would affect all check_nt services right? Not just the 1-3
Also all Windows Firewalls are completely disabled accross the board
Also all Windows Firewalls are completely disabled accross the board
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: NagiosXI Socket Timeout Issue
Hmm, it almost looks like you have a mix of NRPE and check_nt checks running, can you show us the configurations for some of your checks that are timing out? Specifically the drive checks, or any checks that may be related to networked drives/hardware. To view the actual flat service configurations navigate to the following:
Configure > CCM > Services > click the diskette icon next to the 'service name' of one of the services, and copy the output } of that specific service configuration {
Configure > CCM > Services > click the diskette icon next to the 'service name' of one of the services, and copy the output } of that specific service configuration {