Page 1 of 2

XI Load warning 6+ hours

Posted: Mon Jan 13, 2020 3:26 pm
by AUIInternetOps
Please forgive anything we might have left out/not done. This is our first "ticket"/support post since switching to Nagios XI in April 2019.

Red Hat Enterprise Linux Server release 7.6 (Maipo)
Nagios 5.5.8
4 CPU
64 bit
Manual Install
free -m
total used free shared buff/cache available
Mem: 16046 1611 4454 817 9980 13287
Swap: 4095 98 3997

load average: 14.93, 14.76, 14.23

Profile.zip attached

Last week we noticed that the critical load warning for the nagios xi localhost was at ~15% every few days for 2-3 hours and then would drop down. We followed KB https://support.nagios.com/kb/article/n ... s-150.html and it seemed to drop it from 15% for a few hours to just ~6% for much less a time frame for the week.

But today, the load warning of ~15% has been on the board for 6+ hours now. When top is ran, it looks that there is 5+ httpd processes that have anywhere from 30% to 70% CPU each. httpd has not been stopped/restarted so we do not possibly cause further issue without know what it's doing.

Re: XI Load warning 6+ hours

Posted: Mon Jan 13, 2020 3:34 pm
by scottwilkerson
Welcome to the forum!

Looking over your profile it appears you have some crashed tables in your database.

Please run the following procedure and lets see if it settles down
https://assets.nagios.com/downloads/nag ... tabase.pdf

Re: XI Load warning 6+ hours

Posted: Mon Jan 13, 2020 3:58 pm
by AUIInternetOps
Thank you for the direction. Luckily about 20 minutes after I posted, the alarm finally cleared out. I will overlook the supplied PDF and set to the defined steps tomorrow as it's almost COB for us today.

Thanks again.

Re: XI Load warning 6+ hours

Posted: Mon Jan 13, 2020 4:22 pm
by scottwilkerson
AUIInternetOps wrote:Thank you for the direction. Luckily about 20 minutes after I posted, the alarm finally cleared out. I will overlook the supplied PDF and set to the defined steps tomorrow as it's almost COB for us today.

Thanks again.
Sounds good!

Re: XI Load warning 6+ hours

Posted: Tue Jan 14, 2020 9:03 am
by AUIInternetOps
Logged in as root, I am attempting a backup (/usr/local/nagiosxi/scripts/backup_xi.sh) prior to running the table repair but I am getting the following error:

Code: Select all

"Error backing up MySQL database 'nagios' - check the password in this script!"
I see in the xi-sys.cfg the "mysqlpass" listed there, I am able to log in to maria/mysql via the comamnd line, and access the nagios database.

Code: Select all

[root@nagiosxi-1 ~]# mysql -u root -p                               
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 10923456
Server version: 5.5.60-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use nagios;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
In my KB hunting, I am seeing reference to a line with "themysqlpass" in the backup_xi.sh script(though that doc is from 2017), but I do not see that in our script. Is there another reference to the password that I am not seeing/connecting?

Re: XI Load warning 6+ hours

Posted: Tue Jan 14, 2020 9:19 am
by scottwilkerson
You are not going to be able to take a backup if the DB has crashed tables like it does. You must do the repair as it is

Re: XI Load warning 6+ hours

Posted: Tue Jan 14, 2020 9:25 am
by AUIInternetOps
I was also wondering that aspect. Thanks for the confirmation.

Re: XI Load warning 6+ hours

Posted: Tue Jan 14, 2020 9:44 am
by scottwilkerson
AUIInternetOps wrote:I was also wondering that aspect. Thanks for the confirmation.
No problem

Let us know if the load continues to be high after the database repair.

Re: XI Load warning 6+ hours

Posted: Wed Jan 15, 2020 8:56 am
by AUIInternetOps
The table repair yesterday finished with no extraneous errors, and there hasn't been any loglines in the /var/log/mariadb/mariadb.log since yesterday. But we once again are currently showing a high load alert for the server (15.83, 14.53, 12.61). Could it be possible we need to just change the default alert values of the check? Other alerts look to still still come in during these high load times so it does not seem to currently be having an overall negative effect on the server. I will throw in that I've noticed when these high load times occur that the perfdata graphs do not display for the duration, but they show info for the high load time frame after the load subsides/returns to low.

New profile zip from today attached.

Re: XI Load warning 6+ hours

Posted: Wed Jan 15, 2020 9:26 am
by scottwilkerson
AUIInternetOps wrote:Could it be possible we need to just change the default alert values of the check?
Possibly.

I would actually recommend adding more CPU's if this is in a virtual environment as the 4 CPU's are being gobbled up by httpd processes (not sure what else you may be running on this server)

from the top

Code: Select all

13974 apache    20   0  581164  32684   6684 R  83.3  0.2   0:51.65 httpd
18378 apache    20   0  576404  27704   6668 R  61.1  0.2   0:16.45 httpd
20199 apache    20   0  575936  27388   6904 R  61.1  0.2   0:18.90 httpd
17980 apache    20   0  577700  29032   6672 R  55.6  0.2   0:47.17 httpd
 3208 apache    20   0  581532  32972   6716 R  50.0  0.2   0:59.55 httpd
 7607 apache    20   0  584252  35548   6708 R  27.8  0.2   0:42.14 httpd
 3238 apache    20   0       0      0      0 Z  11.1  0.0   0:07.72 httpd
AUIInternetOps wrote:I will throw in that I've noticed when these high load times occur that the perfdata graphs do not display for the duration, but they show info for the high load time frame after the load subsides/returns to low.
this can be adjusted, the default stops processing performance data if the load goes over 10.

edit /usr/local/nagios/etc/pnp/npcd.cfg and change this

Code: Select all

load_threshold = 10.0
to this

Code: Select all

load_threshold = 40.0
then restart npcd

Code: Select all

service npcd restart