API requests causing server load to increase

Dusan.Mandic · Post by **Dusan.Mandic** » Wed Oct 20, 2021 5:06 pm

Hello all,

Is there any way to throttle the API endpoint so that it can only use a certain percentage of resources?

We are on 5.8.6 and are connecting via calls from vRa (Nagios adapter) and it is causing the load to soar into the high 30's (normally 4-6)

Best,

Dusan

Post by **pbroste** » Thu Oct 21, 2021 11:31 am

Hello Dusan,

Thanks for reaching out and providing the System Profile. In review, we see that the Performance data api rrdexport? is quite busy. There are options to optimize. Want to reference the following:

Please let us know if you have further questions.

Thanks,
Perry

Dusan.Mandic · Post by **Dusan.Mandic** » Thu Oct 21, 2021 3:48 pm

Im not certain if I got the gist of what to optimize given the above links.

Is there a way to set a certain percentage of processing to the API? or maybe remove performance data to the API (still want it in XI)? Anything to help offset the load would be pertinent

I toggled off the following in performance settings:

Enable Outbound Data Transfers
Enable Listener For Unconfigured Objects

Post by **pbroste** » Fri Oct 22, 2021 9:53 am

Hello @dusan.mandic

There is no way to set resources for api process and no 'exclusion' logic available to exclude api rrdexport on performance data.

In review, we see that 'process_perfdata.pl' is timing out; there is an option to increase the timeout:

Code: Select all

/usr/local/nagios/etc/pnp/process_perfdata.cfg

To:

TIMEOUT = 20

Other options include excluding perfdata on checks by setting 'process_perf_data=0' on objects. Additionally, the option to write perfdata to file to later import via manually or script helping to reduce resources.

Thanks,
Perry

Dusan.Mandic · Post by **Dusan.Mandic** » Thu Oct 28, 2021 11:09 am

I changed the timeout to 20, didnt see a huge performance impact.

The database just crashed and had to restart. Can you look into this? Uploaded file.

Post by **pbroste** » Fri Oct 29, 2021 10:25 am

Hello @Dusam.Mandic

Thanks for following up with the details and 'System Profile', and in review, we see depleted resources. The following issues comprise the entire environment, not just one element causing issues; it is the accumulated number of resources used throughout.

services are Caught SIGSEGV, shutting down
mysqli::mysqli(): (08004/1040): Too many connections
NPCD: WARN: MAX load reached: load
TIMEOUT: /var/nagiosramdisk/spool/perfdata

We also see that "/opt/ds_agent/ds_am" and "/opt/SumoCollector" are consuming a lot of resources as well, and considering ways to off-load these are recommended.

Generally, at 10K total combined host/service checks, we recommend setting up a RAMDisk. At around 20K, we recommend you start adding an additional XI server because they can only process so much. Now, this may come sooner or later than 20K, depending on what type of checks you are running, how many resources they use, your hardware speed, and what you're doing to mitigate the impact.

We see that you already have implemented RAMDisk; you can read more about optimizing RAMDisk here to verify it is functioning:

https://assets.nagios.com/downloads/nagiosxi/docs/Utilizing_A_RAM_Disk_In_NagiosXI.pdf

It would help if you run this check profiler script to determine what some of your long-running checks are; they consume resources the whole time they are running, so investigating them and reducing them helps:

https://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/Nagios/Profiler-to-check-plugin-execution-time/details

Or this component here:

https://exchange.nagios.org/directory/Addons/Components/Check-Profiler/details

It would help if you also fixed failing or duplicate host/service checks because they consume resources while holding connections open until they timeout or fail and use a lower retry_interval, which will cause even fewer checks to run.

Please read through this doc as well; with the number of checks you are running, I would leave the DB local at this point in time because of the total number of checks you have, it requires a lot of throughput to the DB:

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You can only do so much on a single XI server, you'll need to do what you can to mitigate the impact, but you should start looking at adding another XI server soon if you continue to experience load/kernel message queue/performance issues after doing the mitigation.

Let me know if you have any questions or if I can clarify anything.

Thanks,
Perry

Dusan.Mandic · Post by **Dusan.Mandic** » Mon Nov 01, 2021 11:05 am

Thanks Perry.

I spent the majority of the weekend looking into some of the solutions that you have proposed and Nagios Core/XI docs, and had some follow up questions.

Our DB connections are set to default MariaDB max of 151. What do you suggest as a value? Would there be any value in tuning the memory buffer capacity as well? we have about 32 GB on the box (virtual)

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
1 row in set (0.00 sec)

--- How much value would there be in tuning the following performance values to mitigate load in our current environment---

Dashlet Refresh Multiplier - currently 1000. We have these displayed on wallmonitors in our office, but if I were to change to 3000 or even 5000, how much would that offset load by reducing AJAX calls?

Use Unified Tactical Overview - Use Unified Hostgroup Screens - Use Unified Servicegroup Screens
AFAIK, we dont use any custom widgets in our dashboards, but we do have custom dashes (not elements) for user logins. Would toggling the following on change the dashes when users log in?

Disabling subsystem logging? What details would be lost by disabling this? How much performance gain?

Setting host check interval to 0 - would the performance saving by caching the host checks/relying on service checks be worth instituting in our environment? Is there any value/load reduction in setting up host/service dependencies for check/notification?

The RAMDISK page doesnt have any optimization content that I can see. It seems that all the directives are the same that are in our environment. I just ran the script provided by Nagios when setting it up. Here's a df -h

[htadmin@bnalmnag702 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 138M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/VolGroup00-vg00_lv_root 242G 58G 174G 25% /
/dev/sda1 477M 194M 254M 44% /boot
/dev/mapper/VolGroup00-vg00_lv_tmp 7.9G 18M 7.5G 1% /tmp
/dev/mapper/VolGroup00-vg00_lv_home 8.4G 22M 8.0G 1% /home
/dev/mapper/VolGroup00-vg00_lv_var_log 16G 5.1G 9.8G 35% /var/log
/dev/mapper/VolGroup00-vg00_lv_var_log_audit 3.9G 50M 3.6G 2% /var/log/audit
tmpfs 1000M 42M 959M 5% /var/nagiosramdisk

Seems like its in use, but pretty low.

The reaper settings were previously configured in a previous post
check_result_reaper_frequency=3
max_check_result_reaper_time=10

Would changing these even lower have any benefit?

I appreciate your time

Best,

Dusan

Dusan.Mandic · Post by **Dusan.Mandic** » Mon Nov 01, 2021 1:05 pm

These are the top "offenders" for service time executions

Service: CPU Statistics Average Execution Time: 5.122 (sec) NumChecks: 2
Service: CPU Stats Average Execution Time: 5.144 (sec) NumChecks: 846
Service: Yum Updates Average Execution Time: 5.873 (sec) NumChecks: 7
Service: check_LCKW Average Execution Time: 15.728 (sec) NumChecks: 1

Would changing the freshness thresholds/cache for these make a difference? The check_LCKW needs to be reported in real time. Written in bash, would compiling these be suggested?

Post by **pbroste** » Mon Nov 01, 2021 3:04 pm

Hello Dusan,

I spent the majority of the weekend looking into some of the solutions that you have proposed and Nagios Core/XI docs, and had some follow up questions.

Our DB connections are set to default MariaDB max of 151. What do you suggest as a value? Would there be any value in tuning the memory buffer capacity as well? we have about 32 GB on the box (virtual)

MariaDB [(none)]> SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 151 |
+-----------------+-------+
1 row in set (0.00 sec)

The option to temporarily increase the 'max_connections' parameter so you can find out if you detect performance enhancements.

--- How much value would there be in tuning the following performance values to mitigate load in our current environment---

Dashlet Refresh Multiplier - currently 1000. We have these displayed on wallmonitors in our office, but if I were to change to 3000 or even 5000, how much would that offset load by reducing AJAX calls?

Currently, 1000 is optimal; again, tweaking the numbers may have an advantage. Again depends on individual environments.

Use Unified Tactical Overview - Use Unified Hostgroup Screens - Use Unified Servicegroup Screens
AFAIK, we don't use any custom widgets in our dashboards, but we do have custom dashes (not elements) for user logins. Would toggling the following on change the dashes when users log in?

Performance settings noted; disabling (default out of the box) 'Outbound Data Transfers, Listening for Unconfigured Objects, and Subsystem logging will help a little.

Disabling subsystem logging? What details would be lost by disabling this? How much performance gain?

Subsystem logging will not affect host and service check details; it only provides increased logging for subsystem events.

Setting host check interval to 0 - would the performance saving by caching the host checks/relying on service checks be worth instituting in our environment? Is there any value/load reduction in setting up host/service dependencies for check/notification?

Will explain in the diagram on this linked doc.

The RAMDISK page doesn't have any optimization content that I can see. It seems that all the directives are the same that are in our environment. I just ran the script provided by Nagios when setting it up. Here's a df -h

[htadmin@bnalmnag702 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 4.0K 16G 1% /dev/shm
tmpfs 16G 138M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/VolGroup00-vg00_lv_root 242G 58G 174G 25% /
/dev/sda1 477M 194M 254M 44% /boot
/dev/mapper/VolGroup00-vg00_lv_tmp 7.9G 18M 7.5G 1% /tmp
/dev/mapper/VolGroup00-vg00_lv_home 8.4G 22M 8.0G 1% /home
/dev/mapper/VolGroup00-vg00_lv_var_log 16G 5.1G 9.8G 35% /var/log
/dev/mapper/VolGroup00-vg00_lv_var_log_audit 3.9G 50M 3.6G 2% /var/log/audit
tmpfs 1000M 42M 959M 5% /var/nagiosramdisk

Seems like its in use, but pretty low.

You are correct about the optimal configs when installing the 'nagiosramdisk'.

The reaper settings were previously configured in a previous post
check_result_reaper_frequency=3
max_check_result_reaper_time=10

Would changing these even lower have any benefit?

These settings are optimal.

These are the top "offenders" for service time executions

Service: CPU Statistics Average Execution Time: 5.122 (sec) NumChecks: 2
Service: CPU Stats Average Execution Time: 5.144 (sec) NumChecks: 846
Service: Yum Updates Average Execution Time: 5.873 (sec) NumChecks: 7
Service: check_LCKW Average Execution Time: 15.728 (sec) NumChecks: 1

Would changing the freshness thresholds/cache for these make a difference?

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/freshness.html

The check_LCKW needs to be reported in real-time. Written in bash, would compiling these be suggested?

Realtime runs checks once a minute. But if they are looking for a plugin that uses fewer resources and can run a bit quicker, then a compiled would be faster.

A lot of variables come into play, so that scenarios will vary depending on the environment. Correct, though; there are optimizations that can be made throughout.

Thanks,
Perry

Dusan.Mandic · Post by **Dusan.Mandic** » Wed Nov 03, 2021 11:25 am

Thanks Perry, I've instituted a combination of these and seems like our load is back under 5's. Thanks for the help

Nagios Support Forum

API requests causing server load to increase

API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase

Re: API requests causing server load to increase