We have a NagiosXI installation v5.7.3 running on a (still for now) Centos6.10
This instance has 656 hosts with 19k services.
Lately, we installed mod-gearman service to achieve the main goal of reducing the Nagios CPU load.
The worker is a remote worker which is "extracting" the load from Nagios via a service group configuration.
This service group in Nagios holds just 588 services (we called it WORKER_STPGw), thus this remote worker is handling these for the moment.
In Nagios configuration (nagios.cfg), the NEB module is:
Code: Select all
# Added by NDO 'make install-broker-line' on Wed Sep 9 10:50:48 CEST 2020
broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
broker_module=/usr/lib64/mod_gearman/mod_gearman_nagios4.o config=/etc/mod_gearman/module.conf eventhandler=no
In the graph attached (NagiosCPU.jpg), you will see that when the gearmand daemon was started, the average CPU spiked up instead of going down as expected.
When looking at the /var/log/gearmand/gearmand.log, I just noticed these connection errors.
Code: Select all
ERROR 2021-02-22 09:47:21.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2021-02-22 09:47:21.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
Attached is the worker configuration "/etc/mod_gearman/worker.conf"
The higher CPU was proved to be the worker as when the gearmand was stopped, the load on Nagios returned as it was before.
if we take a look at gearman_top i see the following queues, which i suspect that should be correct.
Code: Select all
[root@am1-sha-nagios2-p etc]# gearman_top -b
2021-03-22 10:59:44 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 1 | 0 | 0
servicegroup_WORKER_STPGw | 200 | 0 | 1
worker_bru-nms-stpgw-p | 1 | 0 | 0
----------------------------------------------------------------------------
Do we have something missing or to adjust in our server/worker configurations?
Please let us know any other requirements you need to follow up on this issue.
Rgds,
Matthew