Page 1 of 2
High Load Average
Posted: Thu Jun 15, 2017 11:45 am
by cbeattie-unitrends
The other day, my Nagios XI host started exhibiting a high load average. The host had stopped responding to anything a couple times recently, but I ran the database repair script afterwards and everything went back to normal. I've even run the database repair script today, to no avail. If I stop Nagios XI, the load average returns to normal, but shoots back up within a couple minutes of restarting Nagios. This computer has 12 CPUs and the load average is typically 3-8.
CentOS 7 x86_64
Nagios XI 5.4.5, manual install
Load average, high system%, and lots of zombies. The zombies go away if I stop Nagios.
Code: Select all
top - 09:29:29 up 1:12, 2 users, load average: 25.48, 24.34, 21.97
Tasks: 371 total, 26 running, 227 sleeping, 0 stopped, 118 zombie
%Cpu(s): 18.3 us, 81.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
strace of a Nagios process. I don't know quite what this is showing me, but I can't imagine that a 1:1 ratio of calls to errors is good.
Code: Select all
[root@den-nagios var]# strace -c -p 25569
Process 25569 attached
^CProcess 25569 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.207732 0 837073 837073 write
------ ----------- ----------- --------- --------- ----------------
100.00 0.207732 837073 837073 total
den-nagios-current_load.png
Re: High Load Average
Posted: Thu Jun 15, 2017 1:44 pm
by tgriep
It could also be a Kernel Message Queue issue causing higher load on the server.
Take a look at this KB article and see if it helps drop the load of your server.
https://support.nagios.com/kb/article/n ... eeded.html
Re: High Load Average
Posted: Thu Jun 15, 2017 4:28 pm
by cbeattie-unitrends
The four symptoms described in that article all fit (though I only had one queue, not multiple). I applied the values suggested in the article, but that didn't fix the high load average.
Code: Select all
top - 15:27:49 up 20 min, 2 users, load average: 21.15, 20.41, 16.23
Tasks: 497 total, 20 running, 267 sleeping, 0 stopped, 210 zombie
%Cpu(s): 16.7 us, 81.0 sy, 0.0 ni, 2.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32931056 total, 30751316 free, 1362784 used, 816956 buff/cache
KiB Swap: 16515068 total, 16515068 free, 0 used. 31002032 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2137 nagios 20 0 10772 1088 748 R 100.0 0.0 11:55.25 nagios
2138 nagios 20 0 10772 1088 748 R 100.0 0.0 11:47.25 nagios
2148 nagios 20 0 10772 1092 748 R 100.0 0.0 12:07.52 nagios
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
Re: High Load Average
Posted: Fri Jun 16, 2017 9:25 am
by tgriep
Can you post ot PM me your System Profile so we can view the system's settings and logs?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and post or PM it to me.
Re: High Load Average
Posted: Mon Jun 19, 2017 8:46 am
by tgriep
In the Apache log file, I am seeing a log of connection errors to the MYSQL database.
You may need to increase the max_connections to the database and the following article are instructions for doing that.
https://support.nagios.com/kb/article/n ... tions.html
There are also a lot of crashed PHP processes, kill them all off and that should stop them from using any system resources.
Code: Select all
nagios 11477 2147 0 Jun15 ? 00:00:00 [php] <defunct>
nagios 11514 2143 0 Jun15 ? 00:00:00 [php] <defunct>
nagios 11568 2151 0 Jun15 ? 00:00:00 [php] <defunct>
nagios 11580 2137 0 Jun15 ? 00:00:00 [php] <defunct>
nagios 11602 2142 0 Jun15 ? 00:00:00 [php] <defunct>
Try that out and let us know if it helps.
Re: High Load Average
Posted: Mon Jun 19, 2017 3:39 pm
by cbeattie-unitrends
tgriep wrote:In the Apache log file, I am seeing a log of connection errors to the MYSQL database.
You may need to increase the max_connections to the database and the following article are instructions for doing that.
...
There are also a lot of crashed PHP processes, kill them all off and that should stop them from using any system resources.
The zombie PHP processes disappear when I stop Nagios.
I implemented the instructions, but Nagios whatever is causing the high load is also causing Nagios to hang. The web site still works, but it's like Nagios isn't doing any work. The checks I set up from the article's instructions are still pending, even after waiting for a few check intervals to pass. Manually executing the command did reveal that needed to be changed anyway, though.
Code: Select all
[root@den-nagios etc]# mysql -uroot -pnagiosxi -e "show global status like 'Max_used_connections';"
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 524 |
+----------------------+-------+
Re: High Load Average
Posted: Mon Jun 19, 2017 3:46 pm
by dwhitfield
Can you attach your my.cnf for review?
Also, please run through the following instructions. Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start
Re: High Load Average
Posted: Tue Jun 20, 2017 11:03 am
by cbeattie-unitrends
There is something crashing or locking up almost as soon as I start Nagios. When I stop Nagios, it takes a couple minutes before the prompt returns. But the zombies disappear and the load average goes back to normal.
Code: Select all
[root@den-nagios ~]# more /etc/my.cnf
[mysqld]
query_cache_size=16M
query_cache_limit=4M
tmp_table_size=64M
max_heap_table_size=64M
key_buffer_size=32M
table_open_cache=32
max_connections=818
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
[root@den-nagios ~]# service nagios stop
Stopping nagios (via systemctl): [ OK ]
[root@den-nagios ~]# service ndo2db stop
Stopping ndo2db (via systemctl): [ OK ]
[root@den-nagios ~]# service mariadb stop
Redirecting to /bin/systemctl stop mariadb.service
[root@den-nagios ~]# service crond stop
Redirecting to /bin/systemctl stop crond.service
[root@den-nagios ~]# service httpd stop
Redirecting to /bin/systemctl stop httpd.service
[root@den-nagios ~]# killall -9 nagios
nagios: no process found
[root@den-nagios ~]# killall -9 ndo2db
ndo2db: no process found
[root@den-nagios ~]# rm -rf /usr/local/nagios/var/rw/nagios.cmd
[root@den-nagios ~]# rm -rf /usr/local/nagios/var/nagios.lock
[root@den-nagios ~]# rm -f /usr/local/nagios/var/ndo.sock
[root@den-nagios ~]# rm -f /usr/local/nagios/var/ndo2db.lock
[root@den-nagios ~]# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
[root@den-nagios ~]# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
[root@den-nagios ~]# service ndo2db start
Starting ndo2db (via systemctl): [ OK ]
[root@den-nagios ~]# service nagios start
Starting nagios (via systemctl): [ OK ]
[root@den-nagios ~]# service mariadb start
Redirecting to /bin/systemctl start mariadb.service
[root@den-nagios ~]# service crond start
Redirecting to /bin/systemctl start crond.service
[root@den-nagios ~]# service httpd start
Redirecting to /bin/systemctl start httpd.service
[root@den-nagios ~]#
Re: High Load Average
Posted: Tue Jun 20, 2017 2:17 pm
by dwhitfield
You need to take a look at
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
The two main things to think about are:
A) If you use mod_gearman, you must use remote workers. Remote in this case means different servers, same network (theoretically, you could do different network, but then there's firewall issues and that's not really solving the issue at hand)
B) Any checks that you can increase their check period, increase their check period.
Ultimately, we generally tell people to split things to 2 servers at 20k checks. You are over 30k. That might just need to be what you do.
Re: High Load Average
Posted: Tue Jun 20, 2017 2:51 pm
by SteveBeauchemin
I sympathize... I have many hosts and services too.
Try running the system with httpd stopped for a while. See if the CPU load and other stuff looks better. If it does look better, then you know that it is some User activity that is putting a burden on your system. I have some users who insist on running things like Birdseye, that used to kill me before I made code changes. Anything that uses the legacy avail.cgi program is also bad for me. It uses 100% of CPU and never actually provides useful data. I also divide between tests that alert, and tests that just gather data. If a test matters and needs to alert, then I run it every 5 minutes. If a test is just gathering metric information, then It runs every 10 or more minutes. Some tests have 15 minute intervals. The more spacing between tests you can live with, the better.
BPI is a killer too... use check_cluster instead where it makes sense.
I run 6 mod_gearman systems on separate hosts. That is the only way I can run with 5000 hosts and 55,000 services.
There are a ton more things I do, but follow the Nagios performance recommendation guide. rrdcache, ramdisk, other stuff... Do all of it.
I also use tail -f /var/log/httpd/*_log to see what is happening. Many times that is where you can find an answer.
Any IO wait states are also bad. Some backup systems steal all your IO cycles.
Also, don't just start up all your processes at once. Stop everything, kill off stuff that didn't stop properly. Then start stuff up one service at a time, make sure that each piece actually starts. Wait and see if the CPU goes bad again then start the next. Try to find exactly what makes things go south.
Been there, didn't like it much... Good luck
Steve B