Hi
@azreenariff,
Thank you for your last reply and uploading the screen shots. I've discussed your system internally here with the support team. What's happening here is that the sever is lagging due to the large number of hosts and services. Generally at 10K total combined host/service checks we recommend that you setup a RAMDisk (you've already done this). At around 20K, we recommend you start looking at adding an additional XI server because they can only process so much. Now this may come sooner or later than 20K depending on what type of checks you are running, how much resources they use, your hardware speed, and what you're doing to mitigate the impact.
Recommendations:
1.
Execution Time Plugin. You should run this check profiler script and see what long running checks you have and determine what some of your long running checks are, they consume resources the whole time they are running so reducing those helps a lot:
https://exchange.nagios.org/directory/P ... me/details
2.
Mod-Gearman. The next step would be for you to look at offloading the checks using mod gearman to reduce the impact on the XI server (you've already done this as well), this would be my recommendation at what you can do to add more services and alleviate the system issues. There's just so much going with around 20K checks that you will need to do what you can to mitigate the impact such as using mod gearman, please see here for more information:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484
NOTE: Make sure that you follow the "Remote Worker Considerations" and the "Host groups and Service groups" sections from the second link above and then follow the "Disable Worker" section from the first link once you've setup your exclude groups.
Please read through this doc as well, with the number of checks you are running I would leave the DB local though at this point in time because of the large amount of total checks you have, it requires a lot of throughput to the DB (recommended enabling jumbo_frames):
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
3.
Adjust Database Settings. Go to Admin > Performance Settings > Databases and adjust your retention settings to the smallest values you can, you're trying to cram way more into a single system than we recommend so you'll need to make some sacrifices somewhere to mitigate that.
4.
Truncate Tables. Your nagios_logentries is huge, want a perfomance boost? Truncate your large tables:
- This will likely speed up KMQ processing so try this first
| nagios_logentries | 10757.31 |
| nagios_notifications | 6351.91 |
Follow this guide here:
- Specifically, follow this section "In certain instances, it may be necessary to truncate (empty) one or more tables" on page 5 of the PDF
https://assets.nagios.com/downloads/nag ... tabase.pdf
4.
Move your DB back to local. This SHOULD fix the kernel message queue processing quick enough (that's the lag you are seeing)
- That's the only solution I've ever been able to find to this "customer has too many hosts/service checks on a single system to process the kernel message queue fast enough across the network" issue.
5.
Additional XI Server. Consider a new XI license and split the load.
Let us know if you have any questions or if we can clarify anything