Page 1 of 1

No more worker available

Posted: Thu Mar 30, 2017 4:43 am
by musklor
Hello, I'm a noob with nagios but I learnt alone.

My work colleague left and I resumed the supervision project.

here my configuration :

nagios 3.5.1-1.el6

Red Hat Enterprise Linux Server release 6.4 (Santiago)

gearmand : 1:0.33-2
mod_gearman : 1.4.14-1.el6

Two master server and two slaves.

Master A with thruk, nagvis, pnp4nagios (10.100*) and slave A in the same network.
Master B in a différent network (10.200*) and slave B in another (10.128*)

Worker.conf on the master A: server=localhost:4730 and dupserver=master_srvB:4730
Worker.conf on the master B server=master_srvA:4730 (same config for the slave A and B)

Worker.conf on the master A : hostgroup=A
Worker.conf on the master B : hostgroup=B
Worker.conf on the slave A : hostgroup=C
Worker.conf on the slave B : hostgroup=D

Nagios conf is synchronised between all servers.(conf, nagios objects etc...)

So, broker_module=....local_hostgroup=nagios, hostgroups=A,B,C,D (nagios.cfg)

Master server A done very well its work, when I throw gearman_top I have no jobs in wait"jobs waiting" and I have many available workers "workers available".

The master server "B" has sudden a saturation of the disk space, thus in particular on "/" He was completely weak, I need to reboot it.

Further to the reboot the jobs were handled well and everything seemed OK.
But then workers arrested, when I throw gearman_top on the master server "B" I have several thousand jobs in wait but no available worker.

Have you a solution? Thank you


Logs gearmand master server A :

Code: Select all

ERROR 2017-02-29 10:59:03.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 11:03:04.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 11:03:04.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 11:03:21.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 11:43:46.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 12:33:47.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:33:47.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:34:22.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:34:22.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:34:25.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:34:25.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
Logs mod_gearman_worker master server A

Code: Select all

[2017-03-29 13:09:56][2403][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:09:56][3368][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2401][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2463][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:27][3943][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:28][3932][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:37][3999][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:42][4046][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:45][4087][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:11:01][4131][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:03][4176][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:05][4206][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4215][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4201][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:51][4533][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:52][4501][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:54][4577][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:55][4614][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109

Logs gearman master server B

Code: Select all

  ERROR 2017-02-28 15:19:31.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:19:31.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:27:13.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:27:13.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:30:31.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:30:31.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:37:58.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:37:58.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:38:04.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:38:04.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 19:51:18.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 19:51:18.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 20:00:06.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 20:00:06.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 09:23:35.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 09:23:35.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:15:37.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:15:37.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:23.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:23.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:36.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:36.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:37.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:37.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 11:45:28.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 11:45:28.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:40:11.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:40:11.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

logs nagios.log master server B

[1490791434] Warning: The check of service 'Service Gestionaire de Licences Citrix' on host 'srvk3' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

I think there is a misconfiguration, maybe this schema is the solution :

Image

I'm thinking about a configuration like that :
mod_worker master A : server=localhost and dupserver=master B
mod_worker master B : server=localhost and dupserver=master A
mod_worker slave A : server=localhost and dupserver=master A
mod_worker slave B : server=localhost and dupserver=master B

Re: No more worker available

Posted: Thu Mar 30, 2017 3:52 pm
by ssax
Let's turn debugging on for the gearmand module on the master server and on one of the workers, enabling debugging is as easy as setting:

Code: Select all

debug=0
To:

Code: Select all

debug=3
This will quickly fill up the logs though, so keep an eye on them.

So, you should enable debug=3 on the Nagios server for the gearmand daemon and the local worker, and on one of the external workers, and then send us the logs (whatever is specified by the logfile option).

Thank you

Re: No more worker available

Posted: Thu Mar 30, 2017 3:54 pm
by ssax
Does it start working again if you run service gearmand restart?

Re: No more worker available

Posted: Fri Mar 31, 2017 3:06 am
by musklor
ssax wrote:Does it start working again if you run service gearmand restart?
No...

Re: No more worker available

Posted: Fri Mar 31, 2017 3:24 am
by musklor
Edited

Re: No more worker available

Posted: Fri Mar 31, 2017 3:50 am
by musklor
ssax wrote:Let's turn debugging on for the gearmand module on the master server and on one of the workers, enabling debugging is as easy as setting:

Code: Select all

debug=0
To:

Code: Select all

debug=3
This will quickly fill up the logs though, so keep an eye on them.

So, you should enable debug=3 on the Nagios server for the gearmand daemon and the local worker, and on one of the external workers, and then send us the logs (whatever is specified by the logfile option).

Thank you
Thanks for the quick answer.

Here a zip version of the files, thanks.

Re: No more worker available

Posted: Fri Mar 31, 2017 2:13 pm
by avandemore
From the bad server, what is the output of:

Code: Select all

netstat -ant
Please attach /var/log/messages from it as well.

You may wish to review this as well: https://support.nagios.com/forum/viewto ... 3&start=20

Re: No more worker available

Posted: Sat Apr 01, 2017 2:15 am
by musklor
avandemore wrote:From the bad server, what is the output of:

Code: Select all

netstat -ant
Please attach /var/log/messages from it as well.

You may wish to review this as well: https://support.nagios.com/forum/viewto ... 3&start=20
Here it is ! Thank you.

I'm not authorized to view this topic.

Re: No more worker available

Posted: Sun Apr 02, 2017 6:20 pm
by dwhitfield
That's a three page long thread on performance issues, but unfortunately it's in the customer forums. If you are a customer and you don't have access, please email [email protected] and they can get you set up.

TL;DR on the thread is you want to check out https://assets.nagios.com/downloads/nag ... giosXI.pdf

Re: No more worker available

Posted: Mon Apr 03, 2017 4:33 am
by musklor
Hello, you can close or delete all my post.
Thanks.