No more worker available

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

No more worker available

Post by musklor »

Hello, I'm a noob with nagios but I learnt alone.

My work colleague left and I resumed the supervision project.

here my configuration :

nagios 3.5.1-1.el6

Red Hat Enterprise Linux Server release 6.4 (Santiago)

gearmand : 1:0.33-2
mod_gearman : 1.4.14-1.el6

Two master server and two slaves.

Master A with thruk, nagvis, pnp4nagios (10.100*) and slave A in the same network.
Master B in a différent network (10.200*) and slave B in another (10.128*)

Worker.conf on the master A: server=localhost:4730 and dupserver=master_srvB:4730
Worker.conf on the master B server=master_srvA:4730 (same config for the slave A and B)

Worker.conf on the master A : hostgroup=A
Worker.conf on the master B : hostgroup=B
Worker.conf on the slave A : hostgroup=C
Worker.conf on the slave B : hostgroup=D

Nagios conf is synchronised between all servers.(conf, nagios objects etc...)

So, broker_module=....local_hostgroup=nagios, hostgroups=A,B,C,D (nagios.cfg)

Master server A done very well its work, when I throw gearman_top I have no jobs in wait"jobs waiting" and I have many available workers "workers available".

The master server "B" has sudden a saturation of the disk space, thus in particular on "/" He was completely weak, I need to reboot it.

Further to the reboot the jobs were handled well and everything seemed OK.
But then workers arrested, when I throw gearman_top on the master server "B" I have several thousand jobs in wait but no available worker.

Have you a solution? Thank you


Logs gearmand master server A :

Code: Select all

ERROR 2017-02-29 10:59:03.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 11:03:04.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 11:03:04.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 11:03:21.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 11:43:46.000000 [  main ] write(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:213
  ERROR 2017-02-29 12:33:47.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:33:47.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:34:22.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:34:22.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:34:25.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:34:25.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
Logs mod_gearman_worker master server A

Code: Select all

[2017-03-29 13:09:56][2403][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:09:56][3368][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2401][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:10][2463][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:27][3943][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:28][3932][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:37][3999][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:42][4046][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:10:45][4087][ERROR] sending job to gearmand failed: connect_poll(Connection refused) getsockopt() failed -> libgearman/connection.cc:104
[2017-03-29 13:11:01][4131][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:03][4176][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:05][4206][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4215][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:06][4201][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:51][4533][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:52][4501][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:54][4577][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109
[2017-03-29 13:11:55][4614][ERROR] sending job to gearmand failed: connect_poll(GEARMAN_TIMEOUT) timeout occurred while trying to connect -> libgearman/connection.cc:109

Logs gearman master server B

Code: Select all

  ERROR 2017-02-28 15:19:31.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:19:31.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:27:13.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:27:13.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:30:31.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:30:31.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:37:58.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:37:58.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 15:38:04.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 15:38:04.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 19:51:18.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 19:51:18.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-28 20:00:06.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-28 20:00:06.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 09:23:35.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 09:23:35.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:15:37.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:15:37.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:23.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:23.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:36.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:36.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 10:16:37.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 10:16:37.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 11:45:28.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 11:45:28.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-02-29 12:40:11.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-02-29 12:40:11.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

logs nagios.log master server B

[1490791434] Warning: The check of service 'Service Gestionaire de Licences Citrix' on host 'srvk3' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

I think there is a misconfiguration, maybe this schema is the solution :

Image

I'm thinking about a configuration like that :
mod_worker master A : server=localhost and dupserver=master B
mod_worker master B : server=localhost and dupserver=master A
mod_worker slave A : server=localhost and dupserver=master A
mod_worker slave B : server=localhost and dupserver=master B
Last edited by musklor on Sat Apr 01, 2017 2:37 am, edited 7 times in total.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: No more worker available

Post by ssax »

Let's turn debugging on for the gearmand module on the master server and on one of the workers, enabling debugging is as easy as setting:

Code: Select all

debug=0
To:

Code: Select all

debug=3
This will quickly fill up the logs though, so keep an eye on them.

So, you should enable debug=3 on the Nagios server for the gearmand daemon and the local worker, and on one of the external workers, and then send us the logs (whatever is specified by the logfile option).

Thank you
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: No more worker available

Post by ssax »

Does it start working again if you run service gearmand restart?
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

Re: No more worker available

Post by musklor »

ssax wrote:Does it start working again if you run service gearmand restart?
No...
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

Re: No more worker available

Post by musklor »

Edited
Last edited by musklor on Fri Mar 31, 2017 8:45 am, edited 1 time in total.
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

Re: No more worker available

Post by musklor »

ssax wrote:Let's turn debugging on for the gearmand module on the master server and on one of the workers, enabling debugging is as easy as setting:

Code: Select all

debug=0
To:

Code: Select all

debug=3
This will quickly fill up the logs though, so keep an eye on them.

So, you should enable debug=3 on the Nagios server for the gearmand daemon and the local worker, and on one of the external workers, and then send us the logs (whatever is specified by the logfile option).

Thank you
Thanks for the quick answer.

Here a zip version of the files, thanks.
Last edited by musklor on Mon Apr 03, 2017 4:28 am, edited 1 time in total.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: No more worker available

Post by avandemore »

From the bad server, what is the output of:

Code: Select all

netstat -ant
Please attach /var/log/messages from it as well.

You may wish to review this as well: https://support.nagios.com/forum/viewto ... 3&start=20
Previous Nagios employee
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

Re: No more worker available

Post by musklor »

avandemore wrote:From the bad server, what is the output of:

Code: Select all

netstat -ant
Please attach /var/log/messages from it as well.

You may wish to review this as well: https://support.nagios.com/forum/viewto ... 3&start=20
Here it is ! Thank you.

I'm not authorized to view this topic.
Last edited by musklor on Mon Apr 03, 2017 4:30 am, edited 2 times in total.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: No more worker available

Post by dwhitfield »

That's a three page long thread on performance issues, but unfortunately it's in the customer forums. If you are a customer and you don't have access, please email [email protected] and they can get you set up.

TL;DR on the thread is you want to check out https://assets.nagios.com/downloads/nag ... giosXI.pdf
musklor
Posts: 6
Joined: Tue Mar 28, 2017 9:21 am

Re: No more worker available

Post by musklor »

Hello, you can close or delete all my post.
Thanks.
Locked