API requests causing server load to increase
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
API requests causing server load to increase
Hello all,
Is there any way to throttle the API endpoint so that it can only use a certain percentage of resources?
We are on 5.8.6 and are connecting via calls from vRa (Nagios adapter) and it is causing the load to soar into the high 30's (normally 4-6)
Best,
Dusan
Is there any way to throttle the API endpoint so that it can only use a certain percentage of resources?
We are on 5.8.6 and are connecting via calls from vRa (Nagios adapter) and it is causing the load to soar into the high 30's (normally 4-6)
Best,
Dusan
You do not have the required permissions to view the files attached to this post.
Re: API requests causing server load to increase
Hello Dusan,
Thanks for reaching out and providing the System Profile. In review, we see that the Performance data api rrdexport? is quite busy. There are options to optimize. Want to reference the following:
Thanks,
Perry
Thanks for reaching out and providing the System Profile. In review, we see that the Performance data api rrdexport? is quite busy. There are options to optimize. Want to reference the following:
- https://support.nagios.com/forum/viewtopic.php?f=6&t=50016#p260641
- https://linux.die.net/man/1/rrdxport
Thanks,
Perry
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: API requests causing server load to increase
Im not certain if I got the gist of what to optimize given the above links.
Is there a way to set a certain percentage of processing to the API? or maybe remove performance data to the API (still want it in XI)? Anything to help offset the load would be pertinent
I toggled off the following in performance settings:
Enable Outbound Data Transfers
Enable Listener For Unconfigured Objects
Is there a way to set a certain percentage of processing to the API? or maybe remove performance data to the API (still want it in XI)? Anything to help offset the load would be pertinent
I toggled off the following in performance settings:
Enable Outbound Data Transfers
Enable Listener For Unconfigured Objects
Re: API requests causing server load to increase
Hello @dusan.mandic
There is no way to set resources for api process and no 'exclusion' logic available to exclude api rrdexport on performance data.
In review, we see that 'process_perfdata.pl' is timing out; there is an option to increase the timeout:
To:
Thanks,
Perry
There is no way to set resources for api process and no 'exclusion' logic available to exclude api rrdexport on performance data.
In review, we see that 'process_perfdata.pl' is timing out; there is an option to increase the timeout:
Code: Select all
/usr/local/nagios/etc/pnp/process_perfdata.cfgOther options include excluding perfdata on checks by setting 'process_perf_data=0' on objects. Additionally, the option to write perfdata to file to later import via manually or script helping to reduce resources.TIMEOUT = 20
Thanks,
Perry
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: API requests causing server load to increase
I changed the timeout to 20, didnt see a huge performance impact.
The database just crashed and had to restart. Can you look into this? Uploaded file.
The database just crashed and had to restart. Can you look into this? Uploaded file.
You do not have the required permissions to view the files attached to this post.
Re: API requests causing server load to increase
Hello @Dusam.Mandic
Thanks for following up with the details and 'System Profile', and in review, we see depleted resources. The following issues comprise the entire environment, not just one element causing issues; it is the accumulated number of resources used throughout.
Generally, at 10K total combined host/service checks, we recommend setting up a RAMDisk. At around 20K, we recommend you start adding an additional XI server because they can only process so much. Now, this may come sooner or later than 20K, depending on what type of checks you are running, how many resources they use, your hardware speed, and what you're doing to mitigate the impact.
We see that you already have implemented RAMDisk; you can read more about optimizing RAMDisk here to verify it is functioning:
https://assets.nagios.com/downloads/nagiosxi/docs/Utilizing_A_RAM_Disk_In_NagiosXI.pdf
It would help if you run this check profiler script to determine what some of your long-running checks are; they consume resources the whole time they are running, so investigating them and reducing them helps:
https://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/Nagios/Profiler-to-check-plugin-execution-time/details
Or this component here:
https://exchange.nagios.org/directory/Addons/Components/Check-Profiler/details
It would help if you also fixed failing or duplicate host/service checks because they consume resources while holding connections open until they timeout or fail and use a lower retry_interval, which will cause even fewer checks to run.
Please read through this doc as well; with the number of checks you are running, I would leave the DB local at this point in time because of the total number of checks you have, it requires a lot of throughput to the DB:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
You can only do so much on a single XI server, you'll need to do what you can to mitigate the impact, but you should start looking at adding another XI server soon if you continue to experience load/kernel message queue/performance issues after doing the mitigation.
Let me know if you have any questions or if I can clarify anything.
Thanks,
Perry
Thanks for following up with the details and 'System Profile', and in review, we see depleted resources. The following issues comprise the entire environment, not just one element causing issues; it is the accumulated number of resources used throughout.
- services are Caught SIGSEGV, shutting down
- mysqli::mysqli(): (08004/1040): Too many connections
- NPCD: WARN: MAX load reached: load
- TIMEOUT: /var/nagiosramdisk/spool/perfdata
Generally, at 10K total combined host/service checks, we recommend setting up a RAMDisk. At around 20K, we recommend you start adding an additional XI server because they can only process so much. Now, this may come sooner or later than 20K, depending on what type of checks you are running, how many resources they use, your hardware speed, and what you're doing to mitigate the impact.
We see that you already have implemented RAMDisk; you can read more about optimizing RAMDisk here to verify it is functioning:
https://assets.nagios.com/downloads/nagiosxi/docs/Utilizing_A_RAM_Disk_In_NagiosXI.pdf
It would help if you run this check profiler script to determine what some of your long-running checks are; they consume resources the whole time they are running, so investigating them and reducing them helps:
https://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/Nagios/Profiler-to-check-plugin-execution-time/details
Or this component here:
https://exchange.nagios.org/directory/Addons/Components/Check-Profiler/details
It would help if you also fixed failing or duplicate host/service checks because they consume resources while holding connections open until they timeout or fail and use a lower retry_interval, which will cause even fewer checks to run.
Please read through this doc as well; with the number of checks you are running, I would leave the DB local at this point in time because of the total number of checks you have, it requires a lot of throughput to the DB:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
You can only do so much on a single XI server, you'll need to do what you can to mitigate the impact, but you should start looking at adding another XI server soon if you continue to experience load/kernel message queue/performance issues after doing the mitigation.
Let me know if you have any questions or if I can clarify anything.
Thanks,
Perry
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: API requests causing server load to increase
Thanks Perry.
I spent the majority of the weekend looking into some of the solutions that you have proposed and Nagios Core/XI docs, and had some follow up questions.
Our DB connections are set to default MariaDB max of 151. What do you suggest as a value? Would there be any value in tuning the memory buffer capacity as well? we have about 32 GB on the box (virtual)
MariaDB [(none)]> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
1 row in set (0.00 sec)
--- How much value would there be in tuning the following performance values to mitigate load in our current environment---
Dashlet Refresh Multiplier - currently 1000. We have these displayed on wallmonitors in our office, but if I were to change to 3000 or even 5000, how much would that offset load by reducing AJAX calls?
Use Unified Tactical Overview - Use Unified Hostgroup Screens - Use Unified Servicegroup Screens
AFAIK, we dont use any custom widgets in our dashboards, but we do have custom dashes (not elements) for user logins. Would toggling the following on change the dashes when users log in?
Disabling subsystem logging? What details would be lost by disabling this? How much performance gain?
Setting host check interval to 0 - would the performance saving by caching the host checks/relying on service checks be worth instituting in our environment? Is there any value/load reduction in setting up host/service dependencies for check/notification?
The RAMDISK page doesnt have any optimization content that I can see. It seems that all the directives are the same that are in our environment. I just ran the script provided by Nagios when setting it up. Here's a df -h
[htadmin@bnalmnag702 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 138M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/VolGroup00-vg00_lv_root 242G 58G 174G 25% /
/dev/sda1 477M 194M 254M 44% /boot
/dev/mapper/VolGroup00-vg00_lv_tmp 7.9G 18M 7.5G 1% /tmp
/dev/mapper/VolGroup00-vg00_lv_home 8.4G 22M 8.0G 1% /home
/dev/mapper/VolGroup00-vg00_lv_var_log 16G 5.1G 9.8G 35% /var/log
/dev/mapper/VolGroup00-vg00_lv_var_log_audit 3.9G 50M 3.6G 2% /var/log/audit
tmpfs 1000M 42M 959M 5% /var/nagiosramdisk
Seems like its in use, but pretty low.
The reaper settings were previously configured in a previous post
check_result_reaper_frequency=3
max_check_result_reaper_time=10
Would changing these even lower have any benefit?
I appreciate your time
Best,
Dusan
I spent the majority of the weekend looking into some of the solutions that you have proposed and Nagios Core/XI docs, and had some follow up questions.
Our DB connections are set to default MariaDB max of 151. What do you suggest as a value? Would there be any value in tuning the memory buffer capacity as well? we have about 32 GB on the box (virtual)
MariaDB [(none)]> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
1 row in set (0.00 sec)
--- How much value would there be in tuning the following performance values to mitigate load in our current environment---
Dashlet Refresh Multiplier - currently 1000. We have these displayed on wallmonitors in our office, but if I were to change to 3000 or even 5000, how much would that offset load by reducing AJAX calls?
Use Unified Tactical Overview - Use Unified Hostgroup Screens - Use Unified Servicegroup Screens
AFAIK, we dont use any custom widgets in our dashboards, but we do have custom dashes (not elements) for user logins. Would toggling the following on change the dashes when users log in?
Disabling subsystem logging? What details would be lost by disabling this? How much performance gain?
Setting host check interval to 0 - would the performance saving by caching the host checks/relying on service checks be worth instituting in our environment? Is there any value/load reduction in setting up host/service dependencies for check/notification?
The RAMDISK page doesnt have any optimization content that I can see. It seems that all the directives are the same that are in our environment. I just ran the script provided by Nagios when setting it up. Here's a df -h
[htadmin@bnalmnag702 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 138M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/VolGroup00-vg00_lv_root 242G 58G 174G 25% /
/dev/sda1 477M 194M 254M 44% /boot
/dev/mapper/VolGroup00-vg00_lv_tmp 7.9G 18M 7.5G 1% /tmp
/dev/mapper/VolGroup00-vg00_lv_home 8.4G 22M 8.0G 1% /home
/dev/mapper/VolGroup00-vg00_lv_var_log 16G 5.1G 9.8G 35% /var/log
/dev/mapper/VolGroup00-vg00_lv_var_log_audit 3.9G 50M 3.6G 2% /var/log/audit
tmpfs 1000M 42M 959M 5% /var/nagiosramdisk
Seems like its in use, but pretty low.
The reaper settings were previously configured in a previous post
check_result_reaper_frequency=3
max_check_result_reaper_time=10
Would changing these even lower have any benefit?
I appreciate your time
Best,
Dusan
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: API requests causing server load to increase
These are the top "offenders" for service time executions
Service: CPU Statistics Average Execution Time: 5.122 (sec) NumChecks: 2
Service: CPU Stats Average Execution Time: 5.144 (sec) NumChecks: 846
Service: Yum Updates Average Execution Time: 5.873 (sec) NumChecks: 7
Service: check_LCKW Average Execution Time: 15.728 (sec) NumChecks: 1
Would changing the freshness thresholds/cache for these make a difference? The check_LCKW needs to be reported in real time. Written in bash, would compiling these be suggested?
Service: CPU Statistics Average Execution Time: 5.122 (sec) NumChecks: 2
Service: CPU Stats Average Execution Time: 5.144 (sec) NumChecks: 846
Service: Yum Updates Average Execution Time: 5.873 (sec) NumChecks: 7
Service: check_LCKW Average Execution Time: 15.728 (sec) NumChecks: 1
Would changing the freshness thresholds/cache for these make a difference? The check_LCKW needs to be reported in real time. Written in bash, would compiling these be suggested?
Re: API requests causing server load to increase
Hello Dusan,
A lot of variables come into play, so that scenarios will vary depending on the environment. Correct, though; there are optimizations that can be made throughout.
Thanks,
Perry
The option to temporarily increase the 'max_connections' parameter so you can find out if you detect performance enhancements.
I spent the majority of the weekend looking into some of the solutions that you have proposed and Nagios Core/XI docs, and had some follow up questions.
Our DB connections are set to default MariaDB max of 151. What do you suggest as a value? Would there be any value in tuning the memory buffer capacity as well? we have about 32 GB on the box (virtual)
MariaDB [(none)]> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
1 row in set (0.00 sec)
Currently, 1000 is optimal; again, tweaking the numbers may have an advantage. Again depends on individual environments.--- How much value would there be in tuning the following performance values to mitigate load in our current environment---
Dashlet Refresh Multiplier - currently 1000. We have these displayed on wallmonitors in our office, but if I were to change to 3000 or even 5000, how much would that offset load by reducing AJAX calls?
Performance settings noted; disabling (default out of the box) 'Outbound Data Transfers, Listening for Unconfigured Objects, and Subsystem logging will help a little.Use Unified Tactical Overview - Use Unified Hostgroup Screens - Use Unified Servicegroup Screens
AFAIK, we don't use any custom widgets in our dashboards, but we do have custom dashes (not elements) for user logins. Would toggling the following on change the dashes when users log in?
Subsystem logging will not affect host and service check details; it only provides increased logging for subsystem events.Disabling subsystem logging? What details would be lost by disabling this? How much performance gain?
Will explain in the diagram on this linked doc.Setting host check interval to 0 - would the performance saving by caching the host checks/relying on service checks be worth instituting in our environment? Is there any value/load reduction in setting up host/service dependencies for check/notification?
You are correct about the optimal configs when installing the 'nagiosramdisk'.The RAMDISK page doesn't have any optimization content that I can see. It seems that all the directives are the same that are in our environment. I just ran the script provided by Nagios when setting it up. Here's a df -h
[htadmin@bnalmnag702 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 138M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/VolGroup00-vg00_lv_root 242G 58G 174G 25% /
/dev/sda1 477M 194M 254M 44% /boot
/dev/mapper/VolGroup00-vg00_lv_tmp 7.9G 18M 7.5G 1% /tmp
/dev/mapper/VolGroup00-vg00_lv_home 8.4G 22M 8.0G 1% /home
/dev/mapper/VolGroup00-vg00_lv_var_log 16G 5.1G 9.8G 35% /var/log
/dev/mapper/VolGroup00-vg00_lv_var_log_audit 3.9G 50M 3.6G 2% /var/log/audit
tmpfs 1000M 42M 959M 5% /var/nagiosramdisk
Seems like its in use, but pretty low.
These settings are optimal.The reaper settings were previously configured in a previous post
check_result_reaper_frequency=3
max_check_result_reaper_time=10
Would changing these even lower have any benefit?
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/freshness.htmlThese are the top "offenders" for service time executions
Service: CPU Statistics Average Execution Time: 5.122 (sec) NumChecks: 2
Service: CPU Stats Average Execution Time: 5.144 (sec) NumChecks: 846
Service: Yum Updates Average Execution Time: 5.873 (sec) NumChecks: 7
Service: check_LCKW Average Execution Time: 15.728 (sec) NumChecks: 1
Would changing the freshness thresholds/cache for these make a difference?
Realtime runs checks once a minute. But if they are looking for a plugin that uses fewer resources and can run a bit quicker, then a compiled would be faster.The check_LCKW needs to be reported in real-time. Written in bash, would compiling these be suggested?
A lot of variables come into play, so that scenarios will vary depending on the environment. Correct, though; there are optimizations that can be made throughout.
Thanks,
Perry
-
Dusan.Mandic
- Posts: 60
- Joined: Mon Apr 06, 2020 2:30 pm
Re: API requests causing server load to increase
Thanks Perry, I've instituted a combination of these and seems like our load is back under 5's. Thanks for the help