Recently, our corporate nagios server and monitoring tool has been showing problems like :
1) Nagios service stopping few times in a row (nagios server). This in turns brings the Nagios monitoring tool down and the page becomes blank(live status).
I have to restart the service manually for this problem to go. But what would cause the service to stop a few times in within the same week or two?
2) The monitoring tool displays a lot of "Service is stale" errors, whereby all services for a particular host is greyed out (stale).
The full error is "This service is stale, no data has been received within the last 1.5 check periods".
I have tried, for some servers, to log on and restart the xinetd/inetd services, and this resolves the problem. But not for all servers.
Some servers still display this error even after I restart the inetd service. Is it common for the service to become stale after a long time of
not restarting xinetd/inetd in the server?
3) Many hosts have "WARNING: unchecked services" error.
Example of the full error is "WARN - 21 unchecked services (hpux_cpu:1, hpux_tunables.shmseg:1, hpux_tunables.semmns:1, df:12, hpux_tunables.maxfiles_lim:1, ntp.time:1, hpux_tunables.nkthread:1, hpux_tunables.nproc:1, tcp_conn_stats:1, hpux_tunables.semmni:1)".
There are two areas that I troubleshoot to resolve this error :
A) Server terminal :
From the terminal, these are the commands I run to resolve this problem :
Code: Select all
# check_mk --check-inventory HOST (This will display the error faced which is the "WARNING:unchecked services" error).
# cmk -I HOST (This resolves the error from the terminal point of view)
# cmk --no-cache -D HOST (This displays all the services monitored by nagios for this host, and the value/threshold for each service)
B) Monitoring tool :
1) Click on the "Edit services" icon for the 'Check_MK Discovery Service'.
2) Click on "Activate missing services", "Save manual check configuration" and "Automatic Refresh"
3) Go to the Main Directory page and choose "Activate Changes"
Then for these changes to take effect, I have to do these on the nagios server :
4) Copy /usr/local/nagios/etc/objects/ check_mk_objects.cfg into /usr/local/nagios/etc/check_mk.d/
Code: Select all
# cp -p /usr/local/nagios/etc/objects/ check_mk_objects.cfg /usr/local/nagios/etc/check_mk.d/
Code: Select all
# cmk -R
4) Sometimes when I click on the "Edit Services" icon for the problem server, it leads to the error "Service discovery failed for this host: Cannot
get data from TCP port 10.61.X.X:6556: timed out".
I am not sure whether this problem is because of the xinetd service needs a restart, or the 6556 port is not open in the server. But in most cases,
port 6556 is already open. Therefore the probably is a problem with the check_mk agent in the server or the xinetd service itself. (Just my assumption).
5) This error "(No output on stdout) stderr: Traceback (most recent call last)" for service "PING" also is very common in our monitoring tool.
What is the cause of the error? And how do I resolve it?
Please suggest the best steps to stop these errors once and for all, and to prevent future occurences of these errors.