Page 1 of 1

Restarting Nagios crashes the database

Posted: Wed May 19, 2021 10:38 am
by paulconca
Hi Team

Every time the Nagios service is restarted the database crashes. Its been happening the last few days.
The reason I restarted Nagios is because I noticed the last check time was not updating in the UI, even though there were no errors in the logs

This is the crash error
210519 15:29:48 [ERROR] mysqld: Table './nagios/nagios_logentries' is marked as crashed and last (automatic?) repair failed

I ran a full repair and it still happens after
/usr/local/nagiosxi/scripts/repair_databases.sh


I also noticed Nagios does not stop correctly

● nagios.service - Nagios Core 4.4.6
Loaded: loaded (/usr/lib/systemd/system/nagios.service; enabled; vendor preset: disabled)
Active: failed (Result: signal) since Wed 2021-05-19 13:59:17 GMT; 53s ago
Docs: https://www.nagios.org/documentation
Process: 22398 ExecStopPost=/usr/bin/rm -f /usr/local/nagios/var/rw/nagios.cmd (code=exited, status=0/SUCCESS)
Process: 21918 ExecStop=/usr/bin/kill -s TERM ${MAINPID} (code=exited, status=0/SUCCESS)
Process: 10792 ExecStart=/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg (code=exited, status=0/SUCCESS)
Process: 10505 ExecStartPre=/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg (code=exited, status=0/SUCCESS)
Main PID: 10796 (code=killed, signal=KILL)

May 19 13:50:10 **********net nagios[10810]: job 45 (pid=16314): read() returned error 11
May 19 13:57:47 **********.net systemd[1]: Stopping Nagios Core 4.4.6...
May 19 13:57:47 **********.net nagios[10796]: Caught SIGTERM, shutting down...
May 19 13:57:47 **********.net nagios[10796]: Caught SIGTERM, shutting down...
May 19 13:57:47 **********.net nagios[10922]: Caught SIGTERM, shutting down...
May 19 13:59:17 **********.net systemd[1]: nagios.service stop-sigterm timed out. Killing.
May 19 13:59:17 **********.net systemd[1]: nagios.service: main process exited, code=killed, status=9/KILL
May 19 13:59:17 **********.net systemd[1]: Stopped Nagios Core 4.4.6.
May 19 13:59:17 **********.net systemd[1]: Unit nagios.service entered failed state.
May 19 13:59:17 **********.net systemd[1]: nagios.service failed.


Operating system
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Nagios version
Nagios XI 5.8.3


Thank you

Re: Restarting Nagios crashes the database

Posted: Wed May 19, 2021 4:08 pm
by dchurch
Can you clarify how you're restarting Nagios? Are you ...
- Rebooting the server?
- Or typing "service nagios restart"?
- Or are you restarting it thru the Nagios XI web interface?
- Or are you using the Apply Config page?

Re: Restarting Nagios crashes the database

Posted: Thu May 20, 2021 8:53 am
by paulconca
So i did a repair this morning.

/usr/local/nagiosxi/scripts/repair_databases.sh

I made some changes and applied (worked fine), it was fine for an hour but then everything stopped processing
- nothing being written to the nagios.log
- no errors in the mariadb.log


I restarted Nagios via the UI, nothing changed no errors and still nothing processing
Ran "service nagios restart" in the terminal, everything started processing again. No errors this time

It was not processing for 45 minutes and mysqld was showing as 100% CPU under the top

I am just waiting to see if it happens again now!

Re: Restarting Nagios crashes the database

Posted: Thu May 20, 2021 9:45 am
by paulconca
Its happening again (attached)


Any way to check what mysqld is doing, could it be the database maintenance?
I have let it sit for 40 minutes

Re: Restarting Nagios crashes the database

Posted: Thu May 20, 2021 12:05 pm
by dchurch
What's the output from the following commands?

Code: Select all

mysql -uroot -pnagiosxi --table <<< "select * from (select table_name, round(((data_length + index_length) / 1024 / 1024), 2) as sz from information_schema.tables where table_schema like 'nagios%') as x order by x.sz;"
mysql -uroot -pnagiosxi -e "show full processlist;"

Re: Restarting Nagios crashes the database

Posted: Thu May 20, 2021 12:17 pm
by paulconca
i did a reboot and it fixed it. "shutdown -h now"

But here is the output

mysql -uroot -p --table <<< "select * from (select table_name, round(((data_length + index_length) / 1024 / 1024), 2) as sz from information_schema.tables where table_schema like 'nagios%') as x order by x.sz;"
+--------------------------------------------+---------+
| table_name | sz |
+--------------------------------------------+---------+
| nagios_serviceescalation_contacts | 0.00 |
| nagios_hostescalations | 0.00 |
| nagios_servicedependencies | 0.00 |
| nagios_hostescalation_contactgroups | 0.00 |
| nagios_hostescalation_contacts | 0.00 |
| nagios_contactgroups | 0.00 |
| nagios_service_parentservices | 0.00 |
| nagios_hostdependencies | 0.00 |
| nagios_host_parenthosts | 0.00 |
| nagios_runtimevariables | 0.00 |
| nagios_timedevents | 0.00 |
| nagios_programstatus | 0.00 |
| nagios_timedeventqueue | 0.00 |
| nagios_instances | 0.00 |
| nagios_dbversion | 0.00 |
| nagios_serviceescalations | 0.01 |
| nagios_contactstatus | 0.01 |
| nagios_hostgroups | 0.01 |
| nagios_serviceescalation_contactgroups | 0.01 |
| nagios_contactgroup_members | 0.01 |
| nagios_contact_addresses | 0.01 |
| nagios_timeperiods | 0.01 |
| nagios_configfilevariables | 0.01 |
| nagios_configfiles | 0.01 |
| nagios_eventhandlers | 0.01 |
| nagios_servicegroups | 0.01 |
| tbl_lnkServicedependencyToServicegroup_DS | 0.02 |
| tbl_lnkHostToContactgroup | 0.02 |
| tbl_lnkTimeperiodToTimeperiod | 0.02 |
| tbl_lnkHosttemplateToHost | 0.02 |
| tbl_lnkContacttemplateToVariabledefinition | 0.02 |
| tbl_lnkServicedependencyToService_S | 0.02 |
| tbl_lnkServicetemplateToVariabledefinition | 0.02 |
| tbl_lnkHosttemplateToContactgroup | 0.02 |
| tbl_lnkServicedependencyToService_DS | 0.02 |
| nagios_contacts | 0.02 |
| tbl_lnkServicetemplateToServicetemplate | 0.02 |
| tbl_lnkHosttemplateToContact | 0.02 |
| tbl_lnkContacttemplateToContacttemplate | 0.02 |
| tbl_lnkServicetemplateToServicegroup | 0.02 |
| tbl_lnkHostgroupToHostgroup | 0.02 |
| tbl_submenu | 0.02 |
| tbl_lnkContacttemplateToContactgroup | 0.02 |
| tbl_lnkServicedependencyToHostgroup_H | 0.02 |
| tbl_lnkHostescalationToHostgroup | 0.02 |
| tbl_lnkServicetemplateToHostgroup | 0.02 |
| tbl_lnkContacttemplateToCommandService | 0.02 |
| tbl_lnkServicedependencyToHostgroup_DH | 0.02 |
| tbl_lnkHostescalationToHost | 0.02 |
| tbl_lnkServicetemplateToHost | 0.02 |
| tbl_session_locks | 0.02 |
| tbl_lnkContacttemplateToCommandHost | 0.02 |
| tbl_lnkServicedependencyToHost_H | 0.02 |
| xi_deploy_jobs | 0.02 |
| tbl_session | 0.02 |
| tbl_lnkContactgroupToContactgroup | 0.02 |
| tbl_lnkServicedependencyToHost_DH | 0.02 |
| xi_deploy_agents | 0.02 |
| tbl_lnkHostescalationToContactgroup | 0.02 |
| tbl_lnkServicetemplateToContactgroup | 0.02 |
| tbl_lnkContactgroupToContact | 0.02 |
| nagios_contact_notificationcommands | 0.02 |
| xi_commands | 0.02 |
| tbl_lnkHostescalationToContact | 0.02 |
| tbl_lnkServicetemplateToContact | 0.02 |
| tbl_lnkServiceToServicegroup | 0.02 |
| tbl_lnkContactToVariabledefinition | 0.02 |
| tbl_lnkHostdependencyToHostgroup_H | 0.02 |
| tbl_lnkServicegroupToServicegroup | 0.02 |
| tbl_lnkHostdependencyToHostgroup_DH | 0.02 |
| tbl_lnkServiceToHostgroup | 0.02 |
| tbl_lnkContactToContacttemplate | 0.02 |
| tbl_lnkServiceescalationToService | 0.02 |
| tbl_lnkHostdependencyToHost_H | 0.02 |
| tbl_lnkServiceescalationToServicegroup | 0.02 |
| nagios_host_contactgroups | 0.02 |
| tbl_lnkContactToContactgroup | 0.02 |
| tbl_lnkServiceescalationToHostgroup | 0.02 |
| tbl_lnkHostdependencyToHost_DH | 0.02 |
| tbl_lnkContactToCommandService | 0.02 |
| tbl_lnkContactToCommandHost | 0.02 |
| tbl_lnkServiceescalationToHost | 0.02 |
| tbl_lnkHosttemplateToVariabledefinition | 0.02 |
| tbl_permission_inactive | 0.02 |
| tbl_lnkServiceescalationToContactgroup | 0.02 |
| xi_cmp_ccm_backups | 0.02 |
| tbl_mainmenu | 0.02 |
| tbl_lnkHosttemplateToHosttemplate | 0.02 |
| nagios_commands | 0.02 |
| tbl_lnkServiceescalationToContact | 0.02 |
| tbl_lnkHostToHostgroup | 0.02 |
| tbl_lnkHosttemplateToHostgroup | 0.02 |
| tbl_lnkServicedependencyToServicegroup_S | 0.02 |
| tbl_lnkHostToHost | 0.02 |
| tbl_logbook | 0.02 |
| tbl_user | 0.03 |
| xi_sessions | 0.03 |
| tbl_hostescalation | 0.03 |
| tbl_timeperiod | 0.03 |
| tbl_hostdependency | 0.03 |
| tbl_domain | 0.03 |
| nagios_contactnotifications | 0.03 |
| xi_eventqueue | 0.03 |
| tbl_contacttemplate | 0.03 |
| tbl_settings | 0.03 |
| nagios_contactnotificationmethods | 0.03 |
| tbl_contactgroup | 0.03 |
| tbl_servicetemplate | 0.03 |
| tbl_servicegroup | 0.03 |
| xi_cmp_trapdata_log | 0.03 |
| xi_cmp_trapdata | 0.03 |
| tbl_serviceextinfo | 0.03 |
| tbl_serviceescalation | 0.03 |
| tbl_servicedependency | 0.03 |
| xi_cmp_favorites | 0.03 |
| tbl_hostgroup | 0.03 |
| tbl_hosttemplate | 0.03 |
| xi_auth_tokens | 0.03 |
| tbl_hostextinfo | 0.03 |
| xi_sysstat | 0.03 |
| nagios_timeperiod_timeranges | 0.04 |
| nagios_systemcommands | 0.04 |
| xi_mibs | 0.05 |
| tbl_lnkHostgroupToHost | 0.05 |
| nagios_host_contacts | 0.05 |
| tbl_lnkHostToVariabledefinition | 0.05 |
| tbl_lnkHostToContact | 0.06 |
| tbl_timedefinition | 0.06 |
| tbl_contact | 0.06 |
| xi_cmp_scheduledreports_log | 0.06 |
| tbl_lnkHostToHosttemplate | 0.06 |
| nagios_hostgroup_members | 0.07 |
| tbl_command | 0.08 |
| xi_users | 0.08 |
| xi_options | 0.09 |
| nagios_scheduleddowntime | 0.12 |
| tbl_lnkServiceToVariabledefinition | 0.14 |
| tbl_lnkServiceToHost | 0.14 |
| nagios_acknowledgements | 0.14 |
| tbl_lnkServiceToServicetemplate | 0.16 |
| tbl_info | 0.17 |
| nagios_flappinghistory | 0.19 |
| tbl_lnkServicegroupToService | 0.20 |
| tbl_lnkServiceToContactgroup | 0.20 |
| nagios_externalcommands | 0.20 |
| nagios_hosts | 0.24 |
| nagios_hostchecks | 0.24 |
| nagios_service_contactgroups | 0.25 |
| nagios_comments | 0.26 |
| tbl_variabledefinition | 0.28 |
| tbl_host | 0.31 |
| nagios_hoststatus | 0.32 |
| nagios_customvariables | 0.36 |
| nagios_customvariablestatus | 0.38 |
| nagios_conninfo | 0.40 |
| nagios_servicegroup_members | 0.44 |
| tbl_lnkServiceToContact | 0.47 |
| xi_events | 0.48 |
| nagios_service_contacts | 0.68 |
| nagios_services | 0.80 |
| nagios_objects | 1.08 |
| nagios_processevents | 1.38 |
| xi_cmp_nagiosbpi_backups | 1.52 |
| tbl_service | 1.52 |
| tbl_permission | 2.02 |
| nagios_downtimehistory | 3.31 |
| xi_usermeta | 3.91 |
| nagios_servicestatus | 8.04 |
| xi_meta | 12.36 |
| nagios_servicechecks | 19.06 |
| xi_auditlog | 19.53 |
| nagios_notifications | 22.48 |
| nagios_commenthistory | 47.63 |
| nagios_statehistory | 60.71 |
| nagios_logentries | 6698.90 |
+--------------------------------------------+---------+


mysql -uroot -p -e "show full processlist;"
+-------+----------+-----------+----------+---------+------+-------+-----------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+-------+----------+-----------+----------+---------+------+-------+-----------------------+----------+
| 19426 | ndoutils | localhost | nagios | Sleep | 0 | | NULL | 0.000 |
| 19427 | ndoutils | localhost | nagios | Sleep | 1 | | NULL | 0.000 |
| 48910 | nagiosxi | localhost | nagiosxi | Sleep | 5 | | NULL | 0.000 |
| 48911 | ndoutils | localhost | nagios | Sleep | 5 | | NULL | 0.000 |
| 48912 | nagiosql | localhost | nagiosql | Sleep | 5 | | NULL | 0.000 |
| 49011 | nagiosxi | localhost | nagiosxi | Sleep | 9 | | NULL | 0.000 |
| 49012 | ndoutils | localhost | nagios | Sleep | 9 | | NULL | 0.000 |
| 49013 | nagiosql | localhost | nagiosql | Sleep | 9 | | NULL | 0.000 |
| 49101 | nagiosxi | localhost | nagiosxi | Sleep | 5 | | NULL | 0.000 |
| 49102 | ndoutils | localhost | nagios | Sleep | 5 | | NULL | 0.000 |
| 49103 | nagiosql | localhost | nagiosql | Sleep | 5 | | NULL | 0.000 |
| 49106 | nagiosxi | localhost | nagiosxi | Sleep | 8 | | NULL | 0.000 |
| 49108 | ndoutils | localhost | nagios | Sleep | 8 | | NULL | 0.000 |
| 49109 | nagiosql | localhost | nagiosql | Sleep | 8 | | NULL | 0.000 |
| 49110 | nagiosxi | localhost | nagiosxi | Sleep | 0 | | NULL | 0.000 |
| 49111 | ndoutils | localhost | nagios | Sleep | 0 | | NULL | 0.000 |
| 49112 | nagiosql | localhost | nagiosql | Sleep | 0 | | NULL | 0.000 |
| 49113 | nagiosxi | localhost | nagiosxi | Sleep | 38 | | NULL | 0.000 |
| 49114 | ndoutils | localhost | nagios | Sleep | 38 | | NULL | 0.000 |
| 49115 | nagiosql | localhost | nagiosql | Sleep | 38 | | NULL | 0.000 |
| 49116 | nagiosxi | localhost | nagiosxi | Sleep | 7 | | NULL | 0.000 |
| 49117 | ndoutils | localhost | nagios | Sleep | 49 | | NULL | 0.000 |
| 49118 | nagiosql | localhost | nagiosql | Sleep | 49 | | NULL | 0.000 |
| 49119 | nagiosxi | localhost | nagiosxi | Sleep | 8 | | NULL | 0.000 |
| 49120 | ndoutils | localhost | nagios | Sleep | 49 | | NULL | 0.000 |
| 49121 | nagiosql | localhost | nagiosql | Sleep | 49 | | NULL | 0.000 |
| 49123 | nagiosxi | localhost | nagiosxi | Sleep | 1 | | NULL | 0.000 |
| 49124 | ndoutils | localhost | nagios | Sleep | 49 | | NULL | 0.000 |
| 49125 | nagiosql | localhost | nagiosql | Sleep | 49 | | NULL | 0.000 |
| 49131 | nagiosxi | localhost | nagiosxi | Sleep | 3 | | NULL | 0.000 |
| 49132 | ndoutils | localhost | nagios | Sleep | 8 | | NULL | 0.000 |
| 49133 | nagiosql | localhost | nagiosql | Sleep | 49 | | NULL | 0.000 |
| 49134 | nagiosxi | localhost | nagiosxi | Sleep | 0 | | NULL | 0.000 |
| 49135 | ndoutils | localhost | nagios | Sleep | 48 | | NULL | 0.000 |
| 49136 | nagiosql | localhost | nagiosql | Sleep | 48 | | NULL | 0.000 |
| 49140 | nagiosxi | localhost | nagiosxi | Sleep | 0 | | NULL | 0.000 |
| 49141 | ndoutils | localhost | nagios | Sleep | 48 | | NULL | 0.000 |
| 49142 | nagiosql | localhost | nagiosql | Sleep | 48 | | NULL | 0.000 |
| 49155 | nagiosxi | localhost | nagiosxi | Sleep | 9 | | NULL | 0.000 |
| 49156 | ndoutils | localhost | nagios | Sleep | 9 | | NULL | 0.000 |
| 49157 | nagiosql | localhost | nagiosql | Sleep | 9 | | NULL | 0.000 |
| 49158 | nagiosxi | localhost | nagiosxi | Sleep | 13 | | NULL | 0.000 |
| 49159 | ndoutils | localhost | nagios | Sleep | 13 | | NULL | 0.000 |
| 49160 | nagiosql | localhost | nagiosql | Sleep | 13 | | NULL | 0.000 |
| 49161 | nagiosxi | localhost | nagiosxi | Sleep | 4 | | NULL | 0.000 |
| 49162 | ndoutils | localhost | nagios | Sleep | 4 | | NULL | 0.000 |
| 49163 | nagiosql | localhost | nagiosql | Sleep | 5 | | NULL | 0.000 |
| 49168 | nagiosxi | localhost | nagiosxi | Sleep | 2 | | NULL | 0.000 |
| 49169 | ndoutils | localhost | nagios | Sleep | 2 | | NULL | 0.000 |
| 49170 | nagiosql | localhost | nagiosql | Sleep | 2 | | NULL | 0.000 |
| 49171 | nagiosxi | localhost | nagiosxi | Sleep | 17 | | NULL | 0.000 |
| 49172 | ndoutils | localhost | nagios | Sleep | 17 | | NULL | 0.000 |
| 49173 | nagiosql | localhost | nagiosql | Sleep | 17 | | NULL | 0.000 |
| 49174 | nagiosxi | localhost | nagiosxi | Sleep | 3 | | NULL | 0.000 |
| 49175 | ndoutils | localhost | nagios | Sleep | 3 | | NULL | 0.000 |
| 49176 | nagiosql | localhost | nagiosql | Sleep | 3 | | NULL | 0.000 |
| 49177 | nagiosxi | localhost | nagiosxi | Sleep | 3 | | NULL | 0.000 |
| 49178 | ndoutils | localhost | nagios | Sleep | 3 | | NULL | 0.000 |
| 49179 | nagiosql | localhost | nagiosql | Sleep | 3 | | NULL | 0.000 |
| 49184 | root | localhost | NULL | Query | 0 | NULL | show full processlist | 0.000 |
+-------+----------+-----------+----------+---------+------+-------+-----------------------+----------+

Re: Restarting Nagios crashes the database

Posted: Thu May 20, 2021 5:19 pm
by ssax
Please go to Admin > Performance Settings and set ALL THREE Optimize Intervals to 300 and click the Update Settings button. See if that alleviates it.

What is the output of this command?

Code: Select all

sar
Given the size of your nagios_logentries table that could be having an impact:

Code: Select all

| nagios_logentries | 6698.90 |
You may want to clean it up to reduce the size of it. See below for a FAQ on this I wrote:

FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?​

You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.

NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password​

If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):

nagios_logentries - Impacts Event Log report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_logentries;'
nagios_statehistory - Impacts the State History report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_statehistory;'
nagios_notifications - Impacts the Notifications report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_notifications;'

These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they ARE crashed, you will need to repair the database FIRST in order to run these queries):

nagios_logentries - Impacts Event Log report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_statehistory - Impacts the State History report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_statehistory WHERE state_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_notifications - Impacts the Notifications report length

Code: Select all

mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_notifications WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'
Then you should go to Admin > Performance Settings > Databases tab and adjust ALL of the retention intervals to meet your business data policy standards to keep them cleaned up as these settings are for adjusting the retention on those DB tables.

I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.