Page 1 of 4

nagios_downtimehistory is marked as crashed

Posted: Thu Jul 23, 2020 10:53 am
by brucej543
Hi. Running NagiosXi 5.6.14. System is crashing with mysql failures trying to update the 'Table './nagios/nagios_downtimehistory'. We have a large number of servers (800+) that are in a reoccurring downtime. When the downtime stops (and starts) we are getting over 18,900+ errors in the /var/log/mariadb/mariadb.log log file and the /var/log/messages files. When it occurs, the systems total memory is used and will crash.
To temporary resolve the error is to rerun the database repair script. From past issues it seem the everytime a downtime starts or ends for a server their is a failure for the /nagios/nagios_downtimehistory.

As a side note, this issue occurred last month also and I was requested to add max_connections = 1000 and open_files_limit = 4096 to the /etc/my.cnf file. I had to remove these entries because when the system was restarted after it crash, it would run out of memory within 10 minutes causing a hard reboot.

Below are samples of the errors:

From the mariadb.log file:
200723 11:30:37 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed.
from the messages log file:
Jul 22 23:02:34 bcnagios01 ndo2db: mysql_error: 'Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed'
Jul 22 23:02:35 bcnagios01 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_downtimehistory SET actual_start_time=FROM_UNIXTIME(1595473200), actual_start_time_usec='952661', was_started='1' WHERE instance_id='1' AND downtime_type='1' AND object_id='6263' AND entry_time=FROM_UNIXTIME(1594954867) AND scheduled_start_time=FROM_UNIXTIME(1595473200) AND scheduled_end_time=FROM_UNIXTIME(1595493000)'
Jul 22 23:02:35 bcnagios01 ndo2db: mysql_error: 'Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed'
Jul 22 23:02:35 bcnagios01 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_downtimehistory SET actual_start_time=FROM_UNIXTIME(1595473200), actual_start_time_usec='961094', was_started='1' WHERE instance_id='1' AND downtime_type='2' AND object_id='10606' AND entry_time=FROM_UNIXTIME(1594954872) AND scheduled_start_time=FROM_UNIXTIME(1595473200) AND

Re: nagios_downtimehistory is marked as crashed

Posted: Thu Jul 23, 2020 1:00 pm
by tgriep
I suspect that when the memory is fully used, the out of memory killer in the linux kernel kills off the MYSQL database as that is using the most memory and that stops the database to finish.

Run these commands to stop the processes, clean and repair the SQL database and to restart the processes. Run them all as root. Show all of the output.

Code: Select all

service npcd stop
service nagios stop
service ndo2db stop
service crond stop
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -u root -pnagiosxi nagiosxi
mysqlcheck -f -r -u root -pnagiosxi --all-databases --use-frm
if grep --quiet pgsql /usr/local/nagiosxi/html/config.inc.php; then service postgresql stop; fi;
service mysqld restart
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /var/run/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /var/lib/mrtg/mrtg_l
rm -f /usr/local/nagiosxi/var/*.lock
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
pkill python
if grep --quiet pgsql /usr/local/nagiosxi/html/config.inc.php; then service postgresql start; fi;
service httpd restart
service ndo2db start
service nagios start
service npcd start
service crond start
Hopefully that will free up enough memory to allow the repair to finish.
Let us know if this does repair the database.

When it is done, run the following as root and post the /tmp/info.txt file to the post so we can get some stats from the server.

Code: Select all

mysql -u root -pnagiosxi -e "show global status like '%used_connections%'; show variables like 'max_connections';" >/tmp/info.txt
echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi >>/tmp/info.txt
top -b -n 1 >>/tmp/info.txt
df -h >>/tmp/info.txt
df -i >>/tmp/info.txt
ps aux >>/tmp/info.txt
Thanks

Re: nagios_downtimehistory is marked as crashed

Posted: Fri Jul 24, 2020 6:59 am
by brucej543
Attached is the requested information

Re: nagios_downtimehistory is marked as crashed

Posted: Fri Jul 24, 2020 11:26 am
by tgriep
Thanks for the files. All of the commands ran to completion and no errors were generated.
Is the system behaving better now and not generating the errors in the messages file?

Re: nagios_downtimehistory is marked as crashed

Posted: Fri Jul 24, 2020 1:45 pm
by brucej543
we are still getting the below errors in the mariandb.log
200724 9:55:04 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed
200724 11:00:04 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed
200724 12:05:17 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed
200724 13:10:18 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed
200724 14:15:04 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed

Re: nagios_downtimehistory is marked as crashed

Posted: Fri Jul 24, 2020 2:15 pm
by brucej543
I just ran another database repair. we will see if the errors show up again. I did not see any errors in the message file since I ran the steps you provided. I will also monitor the messages file

Re: nagios_downtimehistory is marked as crashed

Posted: Fri Jul 24, 2020 3:11 pm
by tgriep
Let us know what you find out.

Re: nagios_downtimehistory is marked as crashed

Posted: Mon Jul 27, 2020 12:13 pm
by brucej543
Ran another database repair due at 0700 this morning due to finding /nagios/nagios_logentries' is marked as crashed in the DB log file
Two hours later, we are now getting 200727 9:20:04 [ERROR] mysqld: Table './nagios/nagios_downtimehistory' is marked as crashed and last (automatic?) repair failed errors that are occurring every 65 minutes. See attach file.

Re: nagios_downtimehistory is marked as crashed

Posted: Mon Jul 27, 2020 12:50 pm
by brucej543
Just had 64 errors logged at 13:44 in the DB log all for nangios_downtimehistory Running db repair again.
Please help to resolve this issue. We can't keep just running this repair job multiple time a day

Re: nagios_downtimehistory is marked as crashed

Posted: Mon Jul 27, 2020 1:26 pm
by tgriep
Let's increase the MYSQL Max Connections settings to see if that resolves the issue.
See this article for instructions.
https://support.nagios.com/kb/article/n ... s-513.html

If the Max Connections are hit, it will cause database corruptions and this may be what is causing the issue on your server.