Sorry for getting back so late.
I still experience orphaned checks with mod_gearman. And currently I experience extreme load on one of my collectors (machine runs only mod_gearman_worker to execute checks).
It is very annoying because I did not do any specific, I did not add thousand more services to check to the collector. Quite strange.
Regarding orphaned checks: when the problem struck in I found that - for some reasons - check_results queue is "full" usually nagios or gearman's event broker module
immediately processes the check results. When problem happened the number of waiting checks in the check_results queue were 7323. Only solution was to restart nagios.
On another forum the explanation was the following for the problem above:
"The check_results queue is pilling up usually means that there is a problem with Nagios or the Mod-Gearman NEB module."
And now my second current problem:
As I mentioned earlier one of my collectors load is extremely high, between 40 and 60. I turned on neb debug and worker debug. What I saw the problem is that checks were
sent not the regular time as they were set in nagios config (3 mins) but in every 15-20 seconds. Check it out:
[2015-09-10 17:19:17][7023][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:19:36][18977][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:19:57][31012][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:16][18972][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:35][7023][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:54][18971][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:13][8427][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:32][12288][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:51][8251][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:22:12][14058][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:25:41][18762][DEBUG] got service job: host-name01 - interfaces
and sometimes it restores back to normal scheduling:
[2015-09-10 17:25:41][18762][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:28:42][10092][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:31:11][9525][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:33:28][17713][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:35:28][17714][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:37:20][9525][DEBUG] got service job: host-name01 - interfaces
Configuration of host-name01:
Code: Select all
Host Description Max. Check Attempts Normal Check Interval Retry Check Interal Obsess Over
host-name01 interfaces 3 0h 3m 0s 0h 1m 0s Yes
Do you have any idea why this is happening?
(collector has 16 cores, and configured to check about 200 hosts and 800-1000 services)
Thanks for help.