Nagios service stopping very often

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
anaigini45
Posts: 4
Joined: Wed Oct 24, 2012 2:17 am

Nagios service stopping very often

Post by anaigini45 »

We have our nagios server integrated with the check_mk monitoring tool for us to monitor all our corporate servers.
Recently, our corporate nagios server and monitoring tool has been showing problems like :

1) Nagios service stopping few times in a row (nagios server). This in turns brings the Nagios monitoring tool down and the page becomes blank(live status).
I have to restart the service manually for this problem to go. But what would cause the service to stop a few times in within the same week or two?

2) The monitoring tool displays a lot of "Service is stale" errors, whereby all services for a particular host is greyed out (stale).
The full error is "This service is stale, no data has been received within the last 1.5 check periods".
I have tried, for some servers, to log on and restart the xinetd/inetd services, and this resolves the problem. But not for all servers.
Some servers still display this error even after I restart the inetd service. Is it common for the service to become stale after a long time of
not restarting xinetd/inetd in the server?

3) Many hosts have "WARNING: unchecked services" error.
Example of the full error is "WARN - 21 unchecked services (hpux_cpu:1, hpux_tunables.shmseg:1, hpux_tunables.semmns:1, df:12, hpux_tunables.maxfiles_lim:1, ntp.time:1, hpux_tunables.nkthread:1, hpux_tunables.nproc:1, tcp_conn_stats:1, hpux_tunables.semmni:1)".
There are two areas that I troubleshoot to resolve this error :
A) Server terminal :
From the terminal, these are the commands I run to resolve this problem :

Code: Select all

# check_mk --check-inventory HOST (This will display  the error faced which is the "WARNING:unchecked services" error).
# cmk -I HOST (This resolves the error from the terminal point of view)
# cmk --no-cache -D HOST (This displays all the services monitored by nagios for this host, and the value/threshold for each service)
However, the steps above does not resolve the error display on the monitoring tool. Therefore the other area to troubleshoot is the monitoring tool :
B) Monitoring tool :
1) Click on the "Edit services" icon for the 'Check_MK Discovery Service'.
2) Click on "Activate missing services", "Save manual check configuration" and "Automatic Refresh"
3) Go to the Main Directory page and choose "Activate Changes"
Then for these changes to take effect, I have to do these on the nagios server :
4) Copy /usr/local/nagios/etc/objects/ check_mk_objects.cfg into /usr/local/nagios/etc/check_mk.d/

Code: Select all

# cp -p /usr/local/nagios/etc/objects/ check_mk_objects.cfg /usr/local/nagios/etc/check_mk.d/
5) Run cmk -R to reload nagios configuration

Code: Select all

# cmk -R
This resolves the problem sometimes, and sometimes not. My question is what is the correct way to resolve this error?

4) Sometimes when I click on the "Edit Services" icon for the problem server, it leads to the error "Service discovery failed for this host: Cannot
get data from TCP port 10.61.X.X:6556: timed out"
.
I am not sure whether this problem is because of the xinetd service needs a restart, or the 6556 port is not open in the server. But in most cases,
port 6556 is already open. Therefore the probably is a problem with the check_mk agent in the server or the xinetd service itself. (Just my assumption).

5) This error "(No output on stdout) stderr: Traceback (most recent call last)" for service "PING" also is very common in our monitoring tool.
What is the cause of the error? And how do I resolve it?

Please suggest the best steps to stop these errors once and for all, and to prevent future occurences of these errors.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios service stopping very often

Post by ssax »

While I can't help you with support for check_mk, we can take a look at why the nagios service is stopping multiple times.

Are you seeing anything related in your /usr/local/nagios/var/nagios.log, /var/log/messages, /var/log/httpd/error_log, or /var/log/httpd/ssl_error_log?

Please enable debug logging in your /usr/local/nagios/etc/nagios.cfg and let it run until it occurs and look in the /usr/local/nagios/var/nagios.debug log:

Code: Select all

debug_file=/usr/local/nagios/var/nagios.debug
debug_level=-1
debug_verbosity=2
Then restart the nagios service:
- NOTE: The above changes will increase the load on the system and dump a lot of data to the nagios.debug file, make sure to keep an eye on it so it doesn't impact your monitoring.

Code: Select all

service nagios restart
Also, please send me the output of these commands:

Code: Select all

/usr/local/nagios/bin/nagios -V
and if you're running NDO2DB, include the output of these commands as well:

Code: Select all

/usr/local/nagios/bin/ndo2db -V
ipcs -q
ps aux | grep ndo
Thank you
Locked