Gearmand server issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
deby23456
Posts: 21
Joined: Tue Apr 11, 2023 1:44 pm

Gearmand server issue

Post by deby23456 »

Hello,

I am having trouble figuring out why my gearmand servers keeps throwing this error "error reading from localhost:4730 - Interrupted system call" when I use gearman_top and some checks (both service and host) are getting orphaned, saying there is no worker available (which there always is, as I have 4 gearmans wokring at the same time for all of my checks).
In total there are 9 workers, they work in groups (4 workers for systems, 4 workers for network and 1 worker for external checks), from times to times, I get service and host CRITICAL alerts saying that my worker might not be available for the queue completely randomly. Server resources are fine across all infra (server and workers), I have ~19k checks.
The only change from "default" configuration is that in network's workers, I have disabled forking just in case that would help, but no.
In debugging from a worker, I saw this error, which makes me believe that the gearmand cannot accept all the results from the workers and that's why I get these errors. But the resources are not even close to high on the server...

[2023-03-06 20:57:37][63881][ERROR] sending job to gearmand failed: flush(GEARMAN_COULD_NOT_CONNECT) Connection to nagiosxi:4730 failed -> libgearman/connection.cc:724: pid(63881)


Today I upgraded all mod-gearman-workers to version -> mod_gearman_worker: version 5.0.1 running on libgearman 1.1.12 but I still get errors in their logs:

[2023-03-07 13:57:11][3606][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:337: pid(3606

Any idea on how to improve the performance or what could be causing this?


Today I added a second gearmand service on the server, in case the issue is with the gearmand not being able to distribute all the jobs and get the results. When I start the second gearmand service, all jobs go to that one and are not split between the two, if I stop the service, all jobs return to gearmand 1.

How can I load balance jobs between two "gearmand" servers? My module.conf configuration for the servers is...

server=127.0.0.1:4730,127.0.0.1:4731

and workers are subscribing to both ports.

I have also tried splitting the workers, talking to different ports each, but the jobs are still hung on the second gearmand.

Thanks.


I solved the issue by minimizing the mix / max workers on my workers configuration so the issue wasn't at the gearmand server.
Topic can close.
Locked