Page 1 of 1

System hangs and locks up from scheduled downtimes ends

Posted: Thu Jun 25, 2020 9:04 am
by brucej543
Running Nagios XI 5.6.14. We have a monthly patching for our production windows servers in which we have a scheduled re-occurring downtime set up that Host Group. The host group is Windows Production servers has 1088 servers with 6546 services. The Nagios scheduled down time starts at 2300 and ends at 0430 the next morning. For the second month in a row, when the end of the downtime occurs, the nagios application hangs and then something is causing the CPU to max out and the only way to clear it was to force reboot the server. The system is on a RedHat sever with 30 CPU's and 46GB of memory which is an overkill for normal daily useage. Because this system is VM, the CPU usage actually cause performance issues on the other systems on the same blade. Attached is our profile.
I already know that I will need to run the repair database script, but I need to provide our management with a reason and corrective action

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

Re: System hangs and locks up from scheduled downtimes ends

Posted: Thu Jun 25, 2020 11:14 am
by brucej543
I ran a verification on the config files and found 1133 config files that have the error "Warning: Duplicate definition found for service 'CPU Usage' on host ..." Could this be the issue. I reviewed several files but do not see an issue. I will post a PM with the validation report and several config files when this is responded to by someone from support.

Re: System hangs and locks up from scheduled downtimes ends

Posted: Thu Jun 25, 2020 5:00 pm
by tgriep
The max_connection setting for the MYSQL database is not large enough and the max connections settings was maxed out on the server at the time the server hung.
When that happens, it will cause database corruptions, high load and possible loss of data.

To increase it, edit the /etc/my.cnf file and under the following section
[mysqld]
put the following

Code: Select all

max_connections = 1000
open_files_limit = 4096
Save the change and run the following as root to restart the processes, truncate temporary data and repair the database.

Code: Select all

systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
systemctl stop crond
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -u root -pnagiosxi nagiosxi
mysqlcheck -f -r -u root -pnagiosxi --all-databases --use-frm
systemctl restart mysqld
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /var/run/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /var/lib/mrtg/mrtg_l
rm -f /usr/local/nagiosxi/var/*.lock
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
pkill python
systemctl restart apache2
systemctl start ndo2db
systemctl start nagios
systemctl start npcd
systemctl start crond
Let us know if this solves the issue.

Re: System hangs and locks up from scheduled downtimes ends

Posted: Thu Jun 25, 2020 5:01 pm
by cdienger
Please PM me the files.

Re: System hangs and locks up from scheduled downtimes ends

Posted: Fri Jun 26, 2020 7:23 am
by brucej543
Thanks for your reply, I will be applying the changes this morning.
Regarding the Warning Duplicate definition messages, I noticed that they all pointed to same server configuration file. I deleted that server's services and host entries and re-defined it back into nagios and that cleared all the warning messages. The validation report now looks great with only a few issues that we need to correct on several monitoring settings.
I will not be uploading any files as that issue has been resolved.

Re: System hangs and locks up from scheduled downtimes ends

Posted: Fri Jun 26, 2020 9:28 am
by brucej543
Performed the steps/commands as instructed with 2 changes
1) the systemctl restart mysqld gave the error message "Failed to restart mysqld.service: Unit not found." Used systemctl restart mariadb.service
2) the systemctl restart apache2 gave the error message "Failed to restart apache2.service: Unit not found." Used systemctl restart httpd.service

The system is performing will no noted issues. Considering that cause of this issue will not be retried until July 23, you can lock this post and I can open a new post if it reoccurs.

Thanks for you help

Re: System hangs and locks up from scheduled downtimes ends

Posted: Fri Jun 26, 2020 4:12 pm
by cdienger
Thanks for the update. Locking.