In addition to my previous message (
please read it and send me the output), pay close attention to this:
Because of the size of some of your database tables I recommend you make these changes:
Please go to Admin > Performance Settings > Databases tab:
- Update all 3 Optimize Intervals to 300
- Click the Update Settings button
Making that change should prevent an issue where one DB optimize hasn't finished when the next one starts, that can cause crashing of DB tables, increasing this timeout should alleviate that.
FAQ: Can I truncate the tables first before proceeding with database repair (if I have crashed tables)?
You can truncate before repairing the DB, it's up to you. If you want to back it up first, you'll need to repair it. If you don't care, or already have a backup, truncate it first as it will speed up the DB repair process.
NOTE: You may need to adjust the -h 127.0.0.1, the -uroot, and -pnagiosxi in the commands if your DB is housed/stored/offloaded/contained on a different server and/or you've changed the root mysql password
If you don't care about the data, or already have a backup, you can just truncate the tables which will essentially drop and recreate the table with zero data in it (removing all historical data for the respective reports):
Code: Select all
nagios_logentries - Impacts Event Log report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_logentries;'
nagios_statehistory - Impacts the State History report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_statehistory;'
nagios_notifications - Impacts the Notifications report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'TRUNCATE TABLE nagios_notifications;'
These should technically work to clean the DB tables up manually (if the tables aren't crashed, if they ARE crashed, you will need to repair the database FIRST in order to run these queries):
Code: Select all
nagios_logentries - Impacts Event Log report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_logentries WHERE logentry_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_statehistory - Impacts the State History report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_statehistory WHERE state_time <= (NOW() - INTERVAL 6 MONTH);'
nagios_notifications - Impacts the Notifications report length
mysql -uroot -pnagiosxi -h 127.0.0.1 -B nagios -e 'DELETE FROM nagios_notifications WHERE start_time <= (NOW() - INTERVAL 6 MONTH);'
Then you should go to Admin > Performance Settings > Databases tab and adjust ALL of the retention intervals to meet your business data policy standards to keep them cleaned up as these settings are for adjusting the retention on those DB tables.
I would lower them to the smallest possible level and utilize the XI backup/restore process and the Admin > Scheduled Backups process to offload the backups to another server. Since these XI backups contain database backups you can spin them up to grab the data and report on them if needed.
See here for more information:
https://assets.nagios.com/downloads/nag ... os-XI.pdf
And here:
https://assets.nagios.com/downloads/nag ... abase.pdf
Including this info as well:
Generally at 10K total combined host/service checks (
you're at 14K) we recommend that you setup a RAMDisk (you've already done that), and at around 20K we recommend you start looking at adding an additional XI server because they can only process so much. Now this may come sooner or later than 20K depending on what type of checks you are running, how much resources they use, your hardware speed, and what you're doing to mitigate the impact.
You can read more about setting up a RAMDisk here:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
You should run this check profiler script and see what long running checks you have and determine what some of your long running checks are, they consume resources the whole time they are running so reducing those helps a lot:
https://exchange.nagios.org/directory/P ... me/details
The next step would be for you to look at offloading the checks using mod gearman to reduce the impact on the XI server (you've also already done this), this would be my recommendation at what you can do to add more services and alleviate the system issues. There's just so much going with around 20K checks that you will need to do what you can to mitigate the impact such as using mod gearman, please see here for more information:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484
NOTE: Make sure that you follow the "Remote Worker Considerations" and the "Host groups and Service groups" sections from the second link above and then follow the "Disable Worker" section from the first link once you've setup your exclude groups.
Please read through this doc as well, did you enable Jumbo Frames on your network for better offloaded DB throughput? Given the sizes of the tables, that's likely most of the issue.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
You can only do so much on a single server, you'll need to do what you can to mitigate the impact but you should start looking at adding another XI server soon if you continue to experience load/performance issues after doing the mitigation.
Let me know if you have any questions or if I can clarify anything.