Database optimization causing application hangs.
Posted: Wed Jul 21, 2021 3:10 pm
Hello Nagios Support,
I've got an XI server here (5.8.3 on Centos 7) with 1,200 Hosts and 11,000 Services. Over the last week I have been troubleshooting Nagios UI hangs or freezes. Checked obvious things like cpu load, memory usage, etc. Nothing stood out besides some iowait spiking in the yellow/red range at times - this is relatively new. We have a high API load against the server, so I revoked all API access to see if it would reduce IO. No real improvements. I had been investigating the iowait when...
Yesterday I signed in to see the server in a bad state, where the DB crashed and needed repair. Not sure what happened. Once repaired (repair_databases.sh), I spent the rest of the day witnessing and troubleshooting debilitating UI hangs and spikes of Host and Service Check Latency upwards of 600 seconds. The system was in a bad state.
I was able to trace the issues back to DB Optimization runs - particularly against the Audit Log, Log Entries, and State History tables. Especially the Audit Log. (The /var/lib/mysql/nagiosxi/xi_auditlog.ibd file was over 5gb). During the Optimization runs, the UI would hang and command processing seemed to stop. (The Monitoring Engine Event Queue would get stacked up, then Host/Service Check Latency would report the lag). Optimization was taking about 12 minutes to run on the Audit Log alone.
I stated to trim back the retention period of some of the tables, when I found some related threads here in the support forum:
https://support.nagios.com/forum/viewto ... 5&p=332540
https://support.nagios.com/forum/viewto ... 4&p=332084
I got more aggressive with reducing the retention period, and ended with the following on these tables:
Audit Log - 90 to 14 days
Log Entries - 90 to 14 days
State History - 720 to 180 days
This certainly helped things, but even with the optimization runs trimmed down to only take 1m 30s, I am still seeing occasional UI hangs and Host/Service Check Latency spiking in the 30-80 second range.
With all of that said!
- Are there any suggested DB tweaks, in Nagios or with Maria itself that might help with the current occasional UI and command-processing hangs?
- Are there any upcoming improvements to the area of app/UI performance, and avoiding Host/Service Check Latency during optimization runs?
- Any consideration being made of a rollback to the old separate ndo2db process which seemed to provide better DB performance? (my previous post about latency/performance issues when ndo2db went away: https://support.nagios.com/forum/viewto ... 16&t=60235 -- I have been seeing that behavior since upgrading off of 5.6.x)
I appreciate any help or pointers.
Thanks,
-marc
I've got an XI server here (5.8.3 on Centos 7) with 1,200 Hosts and 11,000 Services. Over the last week I have been troubleshooting Nagios UI hangs or freezes. Checked obvious things like cpu load, memory usage, etc. Nothing stood out besides some iowait spiking in the yellow/red range at times - this is relatively new. We have a high API load against the server, so I revoked all API access to see if it would reduce IO. No real improvements. I had been investigating the iowait when...
Yesterday I signed in to see the server in a bad state, where the DB crashed and needed repair. Not sure what happened. Once repaired (repair_databases.sh), I spent the rest of the day witnessing and troubleshooting debilitating UI hangs and spikes of Host and Service Check Latency upwards of 600 seconds. The system was in a bad state.
I was able to trace the issues back to DB Optimization runs - particularly against the Audit Log, Log Entries, and State History tables. Especially the Audit Log. (The /var/lib/mysql/nagiosxi/xi_auditlog.ibd file was over 5gb). During the Optimization runs, the UI would hang and command processing seemed to stop. (The Monitoring Engine Event Queue would get stacked up, then Host/Service Check Latency would report the lag). Optimization was taking about 12 minutes to run on the Audit Log alone.
I stated to trim back the retention period of some of the tables, when I found some related threads here in the support forum:
https://support.nagios.com/forum/viewto ... 5&p=332540
https://support.nagios.com/forum/viewto ... 4&p=332084
I got more aggressive with reducing the retention period, and ended with the following on these tables:
Audit Log - 90 to 14 days
Log Entries - 90 to 14 days
State History - 720 to 180 days
This certainly helped things, but even with the optimization runs trimmed down to only take 1m 30s, I am still seeing occasional UI hangs and Host/Service Check Latency spiking in the 30-80 second range.
With all of that said!
- Are there any suggested DB tweaks, in Nagios or with Maria itself that might help with the current occasional UI and command-processing hangs?
- Are there any upcoming improvements to the area of app/UI performance, and avoiding Host/Service Check Latency during optimization runs?
- Any consideration being made of a rollback to the old separate ndo2db process which seemed to provide better DB performance? (my previous post about latency/performance issues when ndo2db went away: https://support.nagios.com/forum/viewto ... 16&t=60235 -- I have been seeing that behavior since upgrading off of 5.6.x)
I appreciate any help or pointers.
Thanks,
-marc