avandemore wrote:Is your system altered in any way?
It has been altered, but most (if not all) alterations have been based off of the Nagios XI tutorials. It was originally built using Nagios XI 2011, I believe, and has been upgraded several times since then. I have an upgrade to 5.3.2 scheduled for tonight.
avandemore wrote:What is the contents of /usr/local/nagiosxi/scripts/nom_create_nagioscore_checkpoint_cond.sh?
Code: Select all
#!/bin/bash
# Create a conditional NOM checkpoint
# Copyright (c) 2008-2015 Nagios Enterprises, LLC. All rights reserved.
# $Id$
scriptsdir=/usr/local/nagiosxi/scripts
/etc/init.d/nagios checkconfig
ret=$?
if [ $ret -eq 0 ]; then
pushd $scriptsdir
./nom_create_nagioscore_checkpoint.sh
popd
echo "Config test passed. Checkpoint created."
exit 0
else
echo "Config test failed. Checkpoint aborted."
exit 1
fi
avandemore wrote:Can you specify exactly how you know this is causing the nagios service to restart? Do you have logs or anything?
Any time that we apply a new configuration from Nagios XI CCM, the Nagios XI web GUI takes several minutes (3-5?) to recover before it becomes usable again/returns accurate data. During this period, users will experience all of the following symptoms during the various stages of recovering:
- Nagios XI shows no hosts and no services
- Nagios XI fails to recognize acknowledgements and scheduled downtimes, which results in hosts/services that have been ACKed or DTed to show up on the Operations Center component (used by our NOC)
While the nom_checkpoint_interval was temporarily set to 360, we were observing this same behavior at 6 hour intervals, but I did not check logs to attempt to trace exactly what happened. As you both believe that KB to not be related, I will leave nom_checkpoint_interval set to the default 1440, but if you suggest, I could at least set it back to 360 to verify/investigate what was causing the web GUI behavior that appeared to indicate service restarts.
rkennedy wrote:Are you running a GUI on top of XI or are these checks running headless? I recommend having Selenium run on it's own machine, and outsourcing the checks to it, rather then running everything on one.
The Nagios XI server uses runlevel 3 (terminal only, no GUI), and selenium runs headless. Migrating Selenium to its own machine has been on my to-do list for a while, but just haven't gotten around to it. I will prioritize that.
rkennedy wrote:Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes
Yes, I have configured the event handler as you suggested.
The load was already in a critical state when the event handler was applied, so I ran the script manually, and got the following output:
Code: Select all
25.1 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19.6 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479241012.perfdata.service
19.2 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479240967.perfdata.service
16.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_ifoperstatus -H 10.REDACTED -C REDACTED -k 215 -n Po2 -v 2
14.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_ifoperstatus -H 10.REDACTED -C REDACTED -k 122 -n Gi6/16 -v 2
6.4 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
3.7 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
3.3 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php