We are facing a problem with Nagios Implementation which is almost a year old. In order to fix the problem we tried to upgrade the system to Nagios XI 2012R2.9 but the problem still persists.
Problem Description : We have around total of 256 hosts configured for monitoring. Some of them are agent based (Windows & Linux) & some are monitored using SNMP. We are facing problem Intermittently with Windows and Linux hosts and we get the following error which in turn produce another error in the Nagios monitoring console. Below are the errors for your reference.
First error that comes is Critical - [IP Address] RTA NAN Lost 100% and disappears after sometime and sometimes it takes long. Where as when this is the state in Nagios console, the machine is very much reachable from the network as well as from Nagios host console.
Second error we receive regularly is CRITICAL Socket Timeout after 10 Seconds.
Note : I would like to highlight again that errors are intermittent and machines are very much reachable from the network and having no problem. Screen shots above are just some examples but we are facing the same problem with almost all the servers.
RTA NAN Host 100% & Socket Timeout Error
Re: RTA NAN Host 100% & Socket Timeout Error
Some Additional Details
1) It is a virtual Machine.
2) It is a dedicated machine for Nagios.
3) It happens randomly on some machines and at different times. But happens on most machines at different times.
4) Problem started from March 2014.
5) Problem stays for some minutes and sometimes longer. The duration is not fixed as well.
6) Many steps taken to fix the problem. Upgrade the Nagios to the newer version. Changing commands to check the Ping, Check_host_alive and changing arguments of these commands as suggested earlier. But problem remains.
We tried to move the VM on different physical host and reboot of switches. Just to highlight, there is no firewall between Nagios server and other hosts under monitoring.
7) Systems are pingable from Nagios as well as other devices on the network.
1) It is a virtual Machine.
2) It is a dedicated machine for Nagios.
3) It happens randomly on some machines and at different times. But happens on most machines at different times.
4) Problem started from March 2014.
5) Problem stays for some minutes and sometimes longer. The duration is not fixed as well.
6) Many steps taken to fix the problem. Upgrade the Nagios to the newer version. Changing commands to check the Ping, Check_host_alive and changing arguments of these commands as suggested earlier. But problem remains.
We tried to move the VM on different physical host and reboot of switches. Just to highlight, there is no firewall between Nagios server and other hosts under monitoring.
7) Systems are pingable from Nagios as well as other devices on the network.
Re: RTA NAN Host 100% & Socket Timeout Error
Are you currently using Nagios XI 2012R2.9? What is the load on the server (especially during the times when you get the "Socket Timeout" errors)?
Can you show us a sample command that you are running from the CLI, along with the output of it?
Also, run the following commands, and show us the output:
Can you show us a sample command that you are running from the CLI, along with the output of it?
Also, run the following commands, and show us the output:
Code: Select all
/usr/local/nagios/libexec/check_ping -V
/usr/local/nagios/libexec/check_icmp -VBe sure to check out our Knowledgebase for helpful articles and solutions!