High Load Average

cbeattie-unitrends · Post by **cbeattie-unitrends** » Thu Jun 15, 2017 11:45 am

The other day, my Nagios XI host started exhibiting a high load average. The host had stopped responding to anything a couple times recently, but I ran the database repair script afterwards and everything went back to normal. I've even run the database repair script today, to no avail. If I stop Nagios XI, the load average returns to normal, but shoots back up within a couple minutes of restarting Nagios. This computer has 12 CPUs and the load average is typically 3-8.

CentOS 7 x86_64
Nagios XI 5.4.5, manual install

Load average, high system%, and lots of zombies. The zombies go away if I stop Nagios.

Code: Select all

top - 09:29:29 up  1:12,  2 users,  load average: 25.48, 24.34, 21.97
Tasks: 371 total,  26 running, 227 sleeping,   0 stopped, 118 zombie
%Cpu(s): 18.3 us, 81.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

strace of a Nagios process. I don't know quite what this is showing me, but I can't imagine that a 1:1 ratio of calls to errors is good.

Code: Select all

[root@den-nagios var]# strace -c -p 25569
Process 25569 attached
^CProcess 25569 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.207732           0    837073    837073 write
------ ----------- ----------- --------- --------- ----------------
100.00    0.207732                837073    837073 total

den-nagios-current_load.png

Post by **tgriep** » Thu Jun 15, 2017 1:44 pm

It could also be a Kernel Message Queue issue causing higher load on the server.
Take a look at this KB article and see if it helps drop the load of your server.
https://support.nagios.com/kb/article/n ... eeded.html

cbeattie-unitrends · Post by **cbeattie-unitrends** » Thu Jun 15, 2017 4:28 pm

The four symptoms described in that article all fit (though I only had one queue, not multiple). I applied the values suggested in the article, but that didn't fix the high load average.

Code: Select all

top - 15:27:49 up 20 min,  2 users,  load average: 21.15, 20.41, 16.23
Tasks: 497 total,  20 running, 267 sleeping,   0 stopped, 210 zombie
%Cpu(s): 16.7 us, 81.0 sy,  0.0 ni,  2.2 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32931056 total, 30751316 free,  1362784 used,   816956 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 31002032 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2137 nagios    20   0   10772   1088    748 R 100.0  0.0  11:55.25 nagios
  2138 nagios    20   0   10772   1088    748 R 100.0  0.0  11:47.25 nagios
  2148 nagios    20   0   10772   1092    748 R 100.0  0.0  12:07.52 nagios
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000

Post by **tgriep** » Fri Jun 16, 2017 9:25 am

Can you post ot PM me your System Profile so we can view the system's settings and logs?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and post or PM it to me.

Post by **tgriep** » Mon Jun 19, 2017 8:46 am

In the Apache log file, I am seeing a log of connection errors to the MYSQL database.
You may need to increase the max_connections to the database and the following article are instructions for doing that.
https://support.nagios.com/kb/article/n ... tions.html

There are also a lot of crashed PHP processes, kill them all off and that should stop them from using any system resources.

Code: Select all

nagios    11477   2147  0 Jun15 ?        00:00:00 [php] <defunct>
nagios    11514   2143  0 Jun15 ?        00:00:00 [php] <defunct>
nagios    11568   2151  0 Jun15 ?        00:00:00 [php] <defunct>
nagios    11580   2137  0 Jun15 ?        00:00:00 [php] <defunct>
nagios    11602   2142  0 Jun15 ?        00:00:00 [php] <defunct>

Try that out and let us know if it helps.

cbeattie-unitrends · Post by **cbeattie-unitrends** » Mon Jun 19, 2017 3:39 pm

tgriep wrote:In the Apache log file, I am seeing a log of connection errors to the MYSQL database.
You may need to increase the max_connections to the database and the following article are instructions for doing that.
...
There are also a lot of crashed PHP processes, kill them all off and that should stop them from using any system resources.

The zombie PHP processes disappear when I stop Nagios.

I implemented the instructions, but Nagios whatever is causing the high load is also causing Nagios to hang. The web site still works, but it's like Nagios isn't doing any work. The checks I set up from the article's instructions are still pending, even after waiting for a few check intervals to pass. Manually executing the command did reveal that needed to be changed anyway, though.

Code: Select all

[root@den-nagios etc]# mysql -uroot -pnagiosxi -e "show global status like 'Max_used_connections';"
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 524   |
+----------------------+-------+

dwhitfield · Post by **dwhitfield** » Mon Jun 19, 2017 3:46 pm

Can you attach your my.cnf for review?

Also, please run through the following instructions. Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc

If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios

If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start

cbeattie-unitrends · Post by **cbeattie-unitrends** » Tue Jun 20, 2017 11:03 am

There is something crashing or locking up almost as soon as I start Nagios. When I stop Nagios, it takes a couple minutes before the prompt returns. But the zombies disappear and the load average goes back to normal.

Code: Select all

[root@den-nagios ~]# more /etc/my.cnf
[mysqld]
query_cache_size=16M
query_cache_limit=4M
tmp_table_size=64M
max_heap_table_size=64M
key_buffer_size=32M
table_open_cache=32
max_connections=818

datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd

[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid

#
# include all files from the config directory
#
!includedir /etc/my.cnf.d

[root@den-nagios ~]# service nagios stop
Stopping nagios (via systemctl):                           [  OK  ]
[root@den-nagios ~]# service ndo2db stop
Stopping ndo2db (via systemctl):                           [  OK  ]
[root@den-nagios ~]# service mariadb stop
Redirecting to /bin/systemctl stop  mariadb.service
[root@den-nagios ~]# service crond stop
Redirecting to /bin/systemctl stop  crond.service
[root@den-nagios ~]# service httpd stop
Redirecting to /bin/systemctl stop  httpd.service
[root@den-nagios ~]# killall -9 nagios
nagios: no process found
[root@den-nagios ~]# killall -9 ndo2db
ndo2db: no process found
[root@den-nagios ~]# rm -rf /usr/local/nagios/var/rw/nagios.cmd
[root@den-nagios ~]# rm -rf /usr/local/nagios/var/nagios.lock
[root@den-nagios ~]# rm -f /usr/local/nagios/var/ndo.sock
[root@den-nagios ~]# rm -f /usr/local/nagios/var/ndo2db.lock
[root@den-nagios ~]# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
[root@den-nagios ~]# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
[root@den-nagios ~]# service ndo2db start
Starting ndo2db (via systemctl):                           [  OK  ]
[root@den-nagios ~]# service nagios start
Starting nagios (via systemctl):                           [  OK  ]
[root@den-nagios ~]# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
[root@den-nagios ~]# service crond start
Redirecting to /bin/systemctl start  crond.service
[root@den-nagios ~]# service httpd start
Redirecting to /bin/systemctl start  httpd.service
[root@den-nagios ~]#

dwhitfield · Post by **dwhitfield** » Tue Jun 20, 2017 2:17 pm

You need to take a look at https://assets.nagios.com/downloads/nag ... ios-XI.pdf

The two main things to think about are:
A) If you use mod_gearman, you must use remote workers. Remote in this case means different servers, same network (theoretically, you could do different network, but then there's firewall issues and that's not really solving the issue at hand)
B) Any checks that you can increase their check period, increase their check period.

Ultimately, we generally tell people to split things to 2 servers at 20k checks. You are over 30k. That might just need to be what you do.

SteveBeauchemin · Post by **SteveBeauchemin** » Tue Jun 20, 2017 2:51 pm

I sympathize... I have many hosts and services too.

Try running the system with httpd stopped for a while. See if the CPU load and other stuff looks better. If it does look better, then you know that it is some User activity that is putting a burden on your system. I have some users who insist on running things like Birdseye, that used to kill me before I made code changes. Anything that uses the legacy avail.cgi program is also bad for me. It uses 100% of CPU and never actually provides useful data. I also divide between tests that alert, and tests that just gather data. If a test matters and needs to alert, then I run it every 5 minutes. If a test is just gathering metric information, then It runs every 10 or more minutes. Some tests have 15 minute intervals. The more spacing between tests you can live with, the better.

BPI is a killer too... use check_cluster instead where it makes sense.

I run 6 mod_gearman systems on separate hosts. That is the only way I can run with 5000 hosts and 55,000 services.

There are a ton more things I do, but follow the Nagios performance recommendation guide. rrdcache, ramdisk, other stuff... Do all of it.

I also use tail -f /var/log/httpd/*_log to see what is happening. Many times that is where you can find an answer.

Any IO wait states are also bad. Some backup systems steal all your IO cycles.

Also, don't just start up all your processes at once. Stop everything, kill off stuff that didn't stop properly. Then start stuff up one service at a time, make sure that each piece actually starts. Wait and see if the CPU goes bad again then start the next. Try to find exactly what makes things go south.

Been there, didn't like it much... Good luck

Steve B

Nagios Support Forum

High Load Average

High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average

Re: High Load Average