Load Spikes on 7 Hour Intervals

drcentner · Post by **drcentner** » Mon Nov 14, 2016 7:48 pm

Our production Nagios server is experiencing the behavior described at https://support.nagios.com/kb/article.php?id=150 (Nagios XI - Core 4 Load Spikes on 1.75 and 7 Hour Intervals) at 7 hour intervals. So far, we have been riding these spikes out, as they typically self-resolve in about 1.5 hours. We have examined OS metrics, but have not been able to pinpoint what is causing the high load. Can you provide some additional details regarding what specifically causes this and/or additional troubleshooting steps?

The issue started while on Nagios XI 5.2.9 on Oct 6th (3.5 months after we upgraded to Nagios XI 5.2.9). We upgraded to 5.3.0 about a week later, but it did not help.

System Profile:

System:
Nagios XI Version : 5.3.0
Manual Install of XI
monitoringserver.domainname.edu 2.6.32-642.6.2.el6.x86_64 x86_64
Red Hat Enterprise Linux Server release 6.8 (Santiago)
Gnome is not installed

Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36
Server Name: monitoringserver.domainname.edu
Server Address: 10.X.X.X
Server Port: 443

Proxy: no
Configurations:
LDAP/AD Authentication
Large Installation Tweaks
RAM Disk

Monitored Hosts: 2009
Monitored Services: 14537

avandemore · Post by **avandemore** » Tue Nov 15, 2016 10:34 am

It is not clear from your post if you followed the workaround in the KB article. Did you change the setting in /usr/local/nagiosxi/html/config.inc.php?

drcentner · Post by **drcentner** » Tue Nov 15, 2016 11:01 am

I did temporarily apply the workaround, using 360 minutes, which did prevent the load spikes. However, after realizing that the workaround was restarting services every 6 hours, I removed the workaround, as restarting services on our production monitoring server every 6 hours is not allowed by my organization, especially not when this occurs during business hours.

rkennedy · Post by **rkennedy** » Tue Nov 15, 2016 11:18 am

Could you PM over a profile for us to look at? (Admin -> System Profile -> Download Profile)

I'd like to look at bit further into what's going on with your system.

avandemore · Post by **avandemore** » Tue Nov 15, 2016 11:43 am

drcentner wrote:However, after realizing that the workaround was restarting services every 6 hours

I am a little confused by this statement. Changing that setting shouldn't restart services. Can you elaborate on what you see?

drcentner · Post by **drcentner** » Tue Nov 15, 2016 2:05 pm

rkennedy wrote:Could you PM over a profile for us to look at? (Admin -> System Profile -> Download Profile)

I PMed the profile zip to @rkennedy and @avandemore.

avandemore wrote:
drcentner wrote:However, after realizing that the workaround was restarting services every 6 hours
I am a little confused by this statement. Changing that setting shouldn't restart services. Can you elaborate on what you see?

When the nom_checkpoint_interval was temporarily adjusted to 360 (6 hours), this resulted in the Nagios XI CCM configuration being applied at those 6 hour intervals, which includes restarting the nagios service, which results in several minutes of downtime at each interval. The KB even states "[The nom_checkpoint_interval] should be set to occur as close to the high load anomaly as possible as to minimize system downtime and stress".

avandemore · Post by **avandemore** » Tue Nov 15, 2016 2:24 pm

drcentner wrote:which includes restarting the nagios service

This is not the default behavior. Is your system altered in any way? What is the contents of /usr/local/nagiosxi/scripts/nom_create_nagioscore_checkpoint_cond.sh? Can you specify exactly how you know this is causing the nagios service to restart? Do you have logs or anything? I have run through the standard code path and there is nothing that would restart services.

It doesn't appear your issue is related to the KB article. @rkennedy is reviewing your profile in detail.

rkennedy · Post by **rkennedy** » Tue Nov 15, 2016 2:33 pm

In addition to what @avandemore mentioned - I don't think this is going to be related to the article you posted.

Code: Select all

[11-13-2016 00:42:44] NPCD: WARN: MAX load reached: load 58.630000/36.000000 at i=0
[11-13-2016 00:43:29] NPCD: WARN: MAX load reached: load 53.020000/36.000000 at i=1
[11-13-2016 00:44:14] NPCD: WARN: MAX load reached: load 46.270000/36.000000 at i=1
[11-13-2016 00:46:49] NPCD: WARN: MAX load reached: load 36.340000/36.000000 at i=0
[11-13-2016 00:47:34] NPCD: WARN: MAX load reached: load 48.530000/36.000000 at i=1
[11-13-2016 00:48:19] NPCD: WARN: MAX load reached: load 50.510000/36.000000 at i=1
[11-13-2016 00:49:04] NPCD: WARN: MAX load reached: load 52.640000/36.000000 at i=1
[11-13-2016 00:52:34] NPCD: WARN: MAX load reached: load 49.610000/36.000000 at i=0
[11-13-2016 00:53:19] NPCD: WARN: MAX load reached: load 46.890000/36.000000 at i=1
[11-13-2016 00:54:04] NPCD: WARN: MAX load reached: load 45.250000/36.000000 at i=1
[11-13-2016 19:09:57] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 19:09:57] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479082161.perfdata.service'
[11-13-2016 19:09:58] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 19:09:58] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479082115.perfdata.service'
[11-13-2016 22:12:03] NPCD: WARN: MAX load reached: load 39.000000/36.000000 at i=0
[11-13-2016 22:12:48] NPCD: WARN: MAX load reached: load 40.610000/36.000000 at i=1
[11-13-2016 22:13:33] NPCD: WARN: MAX load reached: load 52.140000/36.000000 at i=1
[11-13-2016 22:14:18] NPCD: WARN: MAX load reached: load 36.740000/36.000000 at i=1
[11-13-2016 22:17:22] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 22:17:22] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479093365.perfdata.service'
[11-13-2016 22:17:22] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 22:17:22] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479093411.perfdata.service'
[11-13-2016 22:18:07] NPCD: WARN: MAX load reached: load 44.240000/36.000000 at i=0
[11-13-2016 22:18:52] NPCD: WARN: MAX load reached: load 53.070000/36.000000 at i=1
[11-13-2016 22:23:12] NPCD: WARN: MAX load reached: load 55.360000/36.000000 at i=0
[11-13-2016 22:23:57] NPCD: WARN: MAX load reached: load 52.210000/36.000000 at i=1
[11-13-2016 22:27:12] NPCD: WARN: MAX load reached: load 36.680000/36.000000 at i=0
[11-13-2016 22:27:57] NPCD: WARN: MAX load reached: load 47.420000/36.000000 at i=1
[11-13-2016 22:28:42] NPCD: WARN: MAX load reached: load 58.420000/36.000000 at i=1
[11-13-2016 22:32:09] NPCD: WARN: MAX load reached: load 39.250000/36.000000 at i=0
[11-13-2016 22:32:54] NPCD: WARN: MAX load reached: load 51.200000/36.000000 at i=1
[11-13-2016 22:33:39] NPCD: WARN: MAX load reached: load 50.830000/36.000000 at i=1
[11-13-2016 22:37:11] NPCD: WARN: MAX load reached: load 38.830000/36.000000 at i=0
[11-13-2016 22:37:56] NPCD: WARN: MAX load reached: load 47.200000/36.000000 at i=1
[11-13-2016 22:38:41] NPCD: WARN: MAX load reached: load 55.500000/36.000000 at i=1
[11-13-2016 22:42:31] NPCD: WARN: MAX load reached: load 41.700000/36.000000 at i=0
[11-13-2016 22:43:16] NPCD: WARN: MAX load reached: load 54.590000/36.000000 at i=1
[11-13-2016 22:44:01] NPCD: WARN: MAX load reached: load 55.340000/36.000000 at i=1
[11-13-2016 22:46:36] NPCD: WARN: MAX load reached: load 41.840000/36.000000 at i=0
[11-13-2016 22:48:13] NPCD: WARN: MAX load reached: load 50.250000/36.000000 at i=0
[11-13-2016 22:48:58] NPCD: WARN: MAX load reached: load 47.860000/36.000000 at i=1
[11-13-2016 22:51:33] NPCD: WARN: MAX load reached: load 43.730000/36.000000 at i=0
[11-13-2016 22:53:12] NPCD: WARN: MAX load reached: load 48.500000/36.000000 at i=0
[11-13-2016 22:53:57] NPCD: WARN: MAX load reached: load 46.670000/36.000000 at i=1
[11-13-2016 22:57:19] NPCD: WARN: MAX load reached: load 36.790000/36.000000 at i=0
[11-13-2016 22:58:04] NPCD: WARN: MAX load reached: load 42.770000/36.000000 at i=1
[11-13-2016 22:58:49] NPCD: WARN: MAX load reached: load 44.150000/36.000000 at i=1
[11-13-2016 23:02:24] NPCD: WARN: MAX load reached: load 40.260000/36.000000 at i=0
[11-13-2016 23:03:09] NPCD: WARN: MAX load reached: load 37.500000/36.000000 at i=1
[11-13-2016 23:03:54] NPCD: WARN: MAX load reached: load 44.390000/36.000000 at i=1
[11-13-2016 23:07:23] NPCD: WARN: MAX load reached: load 42.540000/36.000000 at i=0
[11-13-2016 23:08:08] NPCD: WARN: MAX load reached: load 37.210000/36.000000 at i=1
[11-13-2016 23:08:53] NPCD: WARN: MAX load reached: load 39.990000/36.000000 at i=1
[11-13-2016 23:10:08] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 23:10:08] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479096381.perfdata.service'
[11-13-2016 23:10:09] NPCD: ERROR: Executed command exits with return code '7'
[11-13-2016 23:10:09] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479096425.perfdata.service'
[11-13-2016 23:11:51] NPCD: WARN: MAX load reached: load 39.190000/36.000000 at i=0
[11-13-2016 23:12:36] NPCD: WARN: MAX load reached: load 38.210000/36.000000 at i=1
[11-13-2016 23:16:51] NPCD: WARN: MAX load reached: load 46.340000/36.000000 at i=0
[11-13-2016 23:17:36] NPCD: WARN: MAX load reached: load 40.340000/36.000000 at i=1
[11-13-2016 23:22:20] NPCD: WARN: MAX load reached: load 37.620000/36.000000 at i=0
[11-13-2016 23:23:56] NPCD: WARN: MAX load reached: load 40.850000/36.000000 at i=0
[11-14-2016 01:09:30] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 01:09:30] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479103716.perfdata.service'
[11-14-2016 13:08:15] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 13:08:15] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479146851.perfdata.service'
[11-14-2016 19:08:27] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 19:08:27] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479168461.perfdata.service'
[11-14-2016 20:23:28] NPCD: WARN: MAX load reached: load 40.560000/36.000000 at i=0
[11-14-2016 20:28:32] NPCD: WARN: MAX load reached: load 41.840000/36.000000 at i=0
[11-14-2016 20:33:30] NPCD: WARN: MAX load reached: load 41.930000/36.000000 at i=0
[11-14-2016 20:41:49] NPCD: WARN: MAX load reached: load 36.370000/36.000000 at i=0
[11-14-2016 20:43:27] NPCD: WARN: MAX load reached: load 37.220000/36.000000 at i=0
[11-14-2016 20:45:34] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 20:45:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479174221.perfdata.service'
[11-14-2016 20:45:34] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 20:45:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479174266.perfdata.service'
[11-14-2016 20:48:01] NPCD: WARN: MAX load reached: load 38.060000/36.000000 at i=0
[11-14-2016 20:48:46] NPCD: WARN: MAX load reached: load 40.720000/36.000000 at i=1
[11-14-2016 20:53:54] NPCD: WARN: MAX load reached: load 38.210000/36.000000 at i=0
[11-14-2016 20:56:20] NPCD: WARN: MAX load reached: load 36.730000/36.000000 at i=0
[11-14-2016 20:57:05] NPCD: WARN: MAX load reached: load 43.530000/36.000000 at i=1
[11-14-2016 20:57:50] NPCD: WARN: MAX load reached: load 39.370000/36.000000 at i=1
[11-14-2016 20:58:35] NPCD: WARN: MAX load reached: load 40.320000/36.000000 at i=1
[11-14-2016 21:01:37] NPCD: ERROR: Executed command exits with return code '7'
[11-14-2016 21:01:37] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479175257.perfdata.service'
[11-14-2016 21:03:13] NPCD: WARN: MAX load reached: load 37.560000/36.000000 at i=0
[11-14-2016 21:08:22] NPCD: WARN: MAX load reached: load 36.920000/36.000000 at i=0
[11-14-2016 21:13:21] NPCD: WARN: MAX load reached: load 36.120000/36.000000 at i=0

The load appears to be spiking for quite a few hours, not just a 7 hour interval.

Judging by the top processes running, I believe this is affecting it -

Code: Select all

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
selenium 16915  0.3  2.8 12263560 468976 ?     Sl   Nov13  12:59 java -jar /usr/local/selenium/selenium-server-standalone-2.53.1.jar
selenium  7245  0.4  1.4 1311772 233800 ?      Sl   Nov13  14:55 /usr/lib64/firefox/firefox -profile /tmp/customProfileDir62e7ad5a1aed43c1bc8ee19dcf808763
selenium 10240  0.4  1.4 1289340 229764 ?      Sl   Nov13   9:30 /usr/lib64/firefox/firefox -profile /tmp/customProfileDir71d3db2e96854d2c8c23050d96ed5d5c
selenium 21381  0.3  1.3 1272944 222332 ?      Sl   Nov13   9:10 /usr/lib64/firefox/firefox -profile /tmp/customProfileDir412d3a12d065497893f4bf8fa64c74df
selenium 13459  0.3  1.2 1283128 207940 ?      Sl   Nov13   9:36 /usr/lib64/firefox/firefox -profile /tmp/customProfileDirf9a6409a07b94cf299b55726c2c1164d
selenium  9936  0.6  1.2 1278392 198516 ?      Sl   11:38   0:52 /usr/lib64/firefox/firefox -profile /tmp/customProfileDir021fa11e01c04ad0bef58c2b4fa8b116
selenium 32529  0.3  1.1 1287592 193656 ?      Sl   Nov13  10:31 /usr/lib64/firefox/firefox -profile /tmp/customProfileDire15673ddd2ef4bf3961b1cadba957dc1
selenium 10328  0.4  1.1 1274068 190348 ?      Sl   04:38   2:33 /usr/lib64/firefox/firefox -profile /tmp/customProfileDir03802060bd5f40cf8a392f618f213040
root     16911  0.3  0.8 417636 136316 ?       Sl   Nov13  11:35 Xvfb :34 -screen 0 1024x768x24

Are you running a GUI on top of XI or are these checks running headless? I recommend having Selenium run on it's own machine, and outsourcing the checks to it, rather then running everything on one. With this amount of checks, I believe you're simply hitting growing pains -

Code: Select all

Total Hosts: 2010 
Total Services: 14543

Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes -

Code: Select all

#!/bin/bash
date=$(date)
echo -e "$date" >> /tmp/checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> /tmp/checktopcpu.txt
echo -e "\n" >> /tmp/checktopcpu.txt

drcentner · Post by **drcentner** » Tue Nov 15, 2016 3:40 pm

avandemore wrote:Is your system altered in any way?

It has been altered, but most (if not all) alterations have been based off of the Nagios XI tutorials. It was originally built using Nagios XI 2011, I believe, and has been upgraded several times since then. I have an upgrade to 5.3.2 scheduled for tonight.

avandemore wrote:What is the contents of /usr/local/nagiosxi/scripts/nom_create_nagioscore_checkpoint_cond.sh?

Code: Select all

#!/bin/bash
# Create a conditional NOM checkpoint
# Copyright (c) 2008-2015 Nagios Enterprises, LLC.  All rights reserved.
# $Id$

scriptsdir=/usr/local/nagiosxi/scripts

/etc/init.d/nagios checkconfig
ret=$?


if [ $ret -eq 0 ]; then
    pushd $scriptsdir
    ./nom_create_nagioscore_checkpoint.sh
    popd
    echo "Config test passed.  Checkpoint created."
    exit 0
else
    echo "Config test failed.  Checkpoint aborted."
    exit 1
fi

avandemore wrote:Can you specify exactly how you know this is causing the nagios service to restart? Do you have logs or anything?

Any time that we apply a new configuration from Nagios XI CCM, the Nagios XI web GUI takes several minutes (3-5?) to recover before it becomes usable again/returns accurate data. During this period, users will experience all of the following symptoms during the various stages of recovering:

Nagios XI shows no hosts and no services
Nagios XI fails to recognize acknowledgements and scheduled downtimes, which results in hosts/services that have been ACKed or DTed to show up on the Operations Center component (used by our NOC)

While the nom_checkpoint_interval was temporarily set to 360, we were observing this same behavior at 6 hour intervals, but I did not check logs to attempt to trace exactly what happened. As you both believe that KB to not be related, I will leave nom_checkpoint_interval set to the default 1440, but if you suggest, I could at least set it back to 360 to verify/investigate what was causing the web GUI behavior that appeared to indicate service restarts.

rkennedy wrote:Are you running a GUI on top of XI or are these checks running headless? I recommend having Selenium run on it's own machine, and outsourcing the checks to it, rather then running everything on one.

The Nagios XI server uses runlevel 3 (terminal only, no GUI), and selenium runs headless. Migrating Selenium to its own machine has been on my to-do list for a while, but just haven't gotten around to it. I will prioritize that.

rkennedy wrote:Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes

Yes, I have configured the event handler as you suggested.

The load was already in a critical state when the event handler was applied, so I ran the script manually, and got the following output:

Code: Select all

25.1 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
19.6 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479241012.perfdata.service
19.2 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1479240967.perfdata.service
16.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_ifoperstatus -H 10.REDACTED -C REDACTED -k 215 -n Po2 -v 2
14.0 /usr/bin/perl -w /usr/local/nagios/libexec/check_ifoperstatus -H 10.REDACTED -C REDACTED -k 122 -n Gi6/16 -v 2
 6.4 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
 3.7 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
 3.3 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php

avandemore · Post by **avandemore** » Tue Nov 15, 2016 4:01 pm

drcentner wrote: Any time that we apply a new configuration from Nagios XI CCM, the Nagios XI web GUI takes several minutes (3-5?) to recover before it becomes usable again/returns accurate data. During this period, users will experience all of the following symptoms during the various stages of recovering:

Nagios XI shows no hosts and no services
Nagios XI fails to recognize acknowledgements and scheduled downtimes, which results in hosts/services that have been ACKed or DTed to show up on the Operations Center component (used by our NOC)

Yeah that's going to be a different issue. You should be able to resolve this part by setting retain_status_information=1:
Host Definition
Service Definition

nom_checkpoint_interval doesn't restart the nagios daemon so if it is being restarted it must come from elsewhere. Basically controls how often /etc/init.d/nagios checkconfig is run.

An Apply Configuration will restart the Nagios daemon.

You can make the change to the interval if you wish, it should have little impact either way.

Nagios Support Forum

Load Spikes on 7 Hour Intervals

Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals