My work colleague left and I resumed the supervision project.
here my configuration :
nagios 3.5.1-1.el6
Red Hat Enterprise Linux Server release 6.4 (Santiago)
gearmand : 1:0.33-2
mod_gearman : 1.4.14-1.el6
Two master server and two slaves.
Master A with thruk, nagvis, pnp4nagios (10.100*) and slave A in the same network.
Master B in a différent network (10.200*) and slave B in another (10.128*)
Worker.conf on the master A: server=localhost:4730 and dupserver=master_srvB:4730
Worker.conf on the master B server=master_srvA:4730 (same config for the slave A and B)
Worker.conf on the master A : hostgroup=A
Worker.conf on the master B : hostgroup=B
Worker.conf on the slave A : hostgroup=C
Worker.conf on the slave B : hostgroup=D
Nagios conf is synchronised between all servers.(conf, nagios objects etc...)
So, broker_module=....local_hostgroup=nagios, hostgroups=A,B,C,D (nagios.cfg)
Master server A done very well its work, when I throw gearman_top I have no jobs in wait"jobs waiting" and I have many available workers "workers available".
The master server "B" has sudden a saturation of the disk space, thus in particular on "/" He was completely weak, I need to reboot it.
Further to the reboot the jobs were handled well and everything seemed OK.
But then workers arrested, when I throw gearman_top on the master server "B" I have several thousand jobs in wait but no available worker.
Have you a solution? Thank you
Logs gearmand master server A :
Code: Select all
ERROR 2017-02-29 10:59:03.000000 [ main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
ERROR 2017-02-29 11:03:04.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 11:03:04.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 11:03:21.000000 [ main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
ERROR 2017-02-29 11:43:46.000000 [ main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
ERROR 2017-02-29 12:33:47.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 12:33:47.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 12:34:22.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 12:34:22.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 12:34:25.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 12:34:25.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109Code: Select all
[2017-03-29 13:09:56][2403][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:09:56][3368][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2401][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2463][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:27][3943][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:28][3932][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:37][3999][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:42][4046][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:45][4087][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:11:01][4131][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:03][4176][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:05][4206][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4215][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4201][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:51][4533][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:52][4501][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:54][4577][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:55][4614][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109Logs gearman master server B
Code: Select all
ERROR 2017-02-28 15:19:31.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 15:19:31.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 15:27:13.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 15:27:13.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 15:30:31.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 15:30:31.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 15:37:58.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 15:37:58.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 15:38:04.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 15:38:04.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 19:51:18.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 19:51:18.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-28 20:00:06.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-28 20:00:06.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 09:23:35.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 09:23:35.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 10:15:37.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 10:15:37.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 10:16:23.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 10:16:23.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 10:16:36.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 10:16:36.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 10:16:37.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 10:16:37.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 11:45:28.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 11:45:28.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-02-29 12:40:11.000000 [ 4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-02-29 12:40:11.000000 [ 4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109logs nagios.log master server B
[1490791434] Warning: The check of service 'Service Gestionaire de Licences Citrix' on host 'srvk3' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
I think there is a misconfiguration, maybe this schema is the solution :

I'm thinking about a configuration like that :
mod_worker master A : server=localhost and dupserver=master B
mod_worker master B : server=localhost and dupserver=master A
mod_worker slave A : server=localhost and dupserver=master A
mod_worker slave B : server=localhost and dupserver=master B