Mod_Gearman causing higher CPU load on NagiosXI
Posted: Mon Mar 22, 2021 5:19 am
Dear Support,
We have a NagiosXI installation v5.7.3 running on a (still for now) Centos6.10
This instance has 656 hosts with 19k services.
Lately, we installed mod-gearman service to achieve the main goal of reducing the Nagios CPU load.
The worker is a remote worker which is "extracting" the load from Nagios via a service group configuration.
This service group in Nagios holds just 588 services (we called it WORKER_STPGw), thus this remote worker is handling these for the moment.
In Nagios configuration (nagios.cfg), the NEB module is:
Attahed is the file "/etc/mod_gearman/module.conf" for your convenience.
It seems that instead of reducing the load on the Nagios server, this caused a higher CPU than what we usually see.
In the graph attached (NagiosCPU.jpg), you will see that when the gearmand daemon was started, the average CPU spiked up instead of going down as expected.
When looking at the /var/log/gearmand/gearmand.log, I just noticed these connection errors.
Those errors may be coming from the worker complaining that it cannot connect to the gearman server. The worker is located in a different geographical area, however from the looks of it, it seems it's working well.
Attached is the worker configuration "/etc/mod_gearman/worker.conf"
The higher CPU was proved to be the worker as when the gearmand was stopped, the load on Nagios returned as it was before.
if we take a look at gearman_top i see the following queues, which i suspect that should be correct.
What could be the cause of having a higher load when the worker daemon is running?
Do we have something missing or to adjust in our server/worker configurations?
Please let us know any other requirements you need to follow up on this issue.
Rgds,
Matthew
We have a NagiosXI installation v5.7.3 running on a (still for now) Centos6.10
This instance has 656 hosts with 19k services.
Lately, we installed mod-gearman service to achieve the main goal of reducing the Nagios CPU load.
The worker is a remote worker which is "extracting" the load from Nagios via a service group configuration.
This service group in Nagios holds just 588 services (we called it WORKER_STPGw), thus this remote worker is handling these for the moment.
In Nagios configuration (nagios.cfg), the NEB module is:
Code: Select all
# Added by NDO 'make install-broker-line' on Wed Sep 9 10:50:48 CEST 2020
broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
broker_module=/usr/lib64/mod_gearman/mod_gearman_nagios4.o config=/etc/mod_gearman/module.conf eventhandler=no
In the graph attached (NagiosCPU.jpg), you will see that when the gearmand daemon was started, the average CPU spiked up instead of going down as expected.
When looking at the /var/log/gearmand/gearmand.log, I just noticed these connection errors.
Code: Select all
ERROR 2021-02-22 09:47:21.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2021-02-22 09:47:21.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
Attached is the worker configuration "/etc/mod_gearman/worker.conf"
The higher CPU was proved to be the worker as when the gearmand was stopped, the load on Nagios returned as it was before.
if we take a look at gearman_top i see the following queues, which i suspect that should be correct.
Code: Select all
[root@am1-sha-nagios2-p etc]# gearman_top -b
2021-03-22 10:59:44 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 1 | 0 | 0
servicegroup_WORKER_STPGw | 200 | 0 | 1
worker_bru-nms-stpgw-p | 1 | 0 | 0
----------------------------------------------------------------------------
Do we have something missing or to adjust in our server/worker configurations?
Please let us know any other requirements you need to follow up on this issue.
Rgds,
Matthew