Page 1 of 2

Worker shows up in Unhandled hosts "mod_gearman_worker down"

Posted: Mon Jun 08, 2015 1:35 am
by litsupport.box
Hello,

Today(not the first time) i saw my worker show up at "Unhandled hosts" when i ping the host i get a respond so it was up. I logged into the worker with ssh and checked "service mod_gearman_worker status" it told me it was not running.

How can i check why my mod gearman worker stopped working?

Please advise on how to investigate/repair this.

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Mon Jun 08, 2015 9:22 am
by lmiltchev
Is the "mod_gearman_worker" set up to start automatically on reboot? What is the output of the following command?

Code: Select all

chkconfig --list | grep mod_gearman_worker

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Tue Jun 09, 2015 1:56 am
by litsupport.box

Code: Select all

]# chkconfig --list | grep mod_gearman_worker
mod_gearman_worker      0:off   1:off   2:on    3:on    4:on    5:on    6:off
guess it is

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Tue Jun 09, 2015 10:01 am
by jdalrymple
A few things:

By default there is a log at /var/log/mod_gearman/mod_gearman_worker.log

Look there and see if there aren't any clues. I bet there will be. If not up your debug level in /etc/mod_gearman/mod_gearman_worker.conf to 2:

Code: Select all

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=2
Is the worker on the same host as Nagios, I assume not?
Does the Linux box hosting the worker have any monitoring enabled, load, memory, etc?
Do you have multiple workers and if so did any others exhibit weirdness?

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Thu Jun 11, 2015 2:05 am
by litsupport.box
jdalrymple wrote:A few things:

By default there is a log at /var/log/mod_gearman/mod_gearman_worker.log

Look there and see if there aren't any clues. I bet there will be. If not up your debug level in /etc/mod_gearman/mod_gearman_worker.conf to 2:

Code: Select all

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=2
Is the worker on the same host as Nagios, I assume not?
Does the Linux box hosting the worker have any monitoring enabled, load, memory, etc?
Do you have multiple workers and if so did any others exhibit weirdness?
This is what i see in the log:

There is many many many of these:
[3423][INFO ] no checks in 2minutes, restarting all workers
[1431][INFO ] no checks in 2minutes, restarting all workers

Earlier that day:
[2015-06-08 00:13:52][1432][INFO ] no checks in 2minutes, restarting all workers
[2015-06-08 00:18:54][1432][INFO ] no checks in 2minutes, restarting all workers
[2015-06-08 00:23:54][1432][INFO ] no checks in 2minutes, restarting all workers
[2015-06-08 00:32:32][1432][INFO ] mod_gearman worker daemon started with pid 1432
[2015-06-08 00:32:32][1432][INFO ] found pid file for: 1432
[2015-06-08 00:32:32][1432][INFO ] pidfile already exists, cannot start!
[2015-06-08 08:36:23][3423][INFO ] mod_gearman worker daemon started with pid 3423
[2015-06-08 08:36:23][3423][INFO ] found pid file for: 1432
[2015-06-08 08:36:23][3423][INFO ] removed stale pidfile
[2015-06-08 08:36:24][3423][INFO ] no checks in 2minutes, restarting all workers

Later that day:
[2015-06-08 09:05:52][4091][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:05:54][4092][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:00][3423][INFO ] no checks in 2minutes, restarting all workers
[2015-06-08 09:06:14][4092][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:14][4092][ERROR] cannot start client
[2015-06-08 09:06:24][4102][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:24][4100][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:24][4101][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:24][4099][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:24][4103][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:35][4104][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4102][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4102][ERROR] cannot start client
[2015-06-08 09:06:44][4101][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4101][ERROR] cannot start client
[2015-06-08 09:06:44][4099][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4099][ERROR] cannot start client
[2015-06-08 09:06:44][4100][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4100][ERROR] cannot start client
[2015-06-08 09:06:44][4103][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:44][4103][ERROR] cannot start client
[2015-06-08 09:06:55][4104][ERROR] client error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:06:55][4104][ERROR] cannot start client
[2015-06-08 09:07:05][4108][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:07:05][4110][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:07:05][4107][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:07:05][4111][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211
[2015-06-08 09:07:05][4109][ERROR] worker error: gearman_connection_create_args(GEARMAN_GETADDRINFO) -> libgearman/connection.cc:211

At 11:07 i restarted mod_gearman_worker service.
And then from around the same time i got the mail that my worker was unreachable:

[2015-06-08 11:02:56][3423][INFO ] no checks in 2minutes, restarting all workers
[2015-06-08 11:12:36][1431][INFO ] mod_gearman worker daemon started with pid 1431
[2015-06-08 11:12:36][1431][INFO ] found pid file for: 3423
[2015-06-08 11:12:36][1431][INFO ] removed stale pidfile
[2015-06-08 11:12:38][1431][INFO ] no checks in 2minutes, restarting all workers
------------------------------------------------------------------------------------------------------------------------------------

Is the worker on the same host as Nagios?
This host is installed on a completely different host as the Nagios XI server, as are 15 others.

Does the Linux box hosting the worker have any monitoring enabled, load, memory, etc?
No, do you have any tips?

Do you have multiple workers and if so did any other exhibit weirdness?
Yes, there's about 15+ others around the world. This is maybe the second or third time this happened to any of my workers, everytime i restart the service or the machine it is fixed. But this is more of a workaround for me. I didn't investigate on these.

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Thu Jun 11, 2015 2:45 pm
by tgriep
I found this description for the GEARMAN_GETADDRINFO error, "Name resolution failed for a host."
Was there a DNS issue on your network at the time of the failure?

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Mon Jun 15, 2015 4:57 am
by litsupport.box
There were no DNS issues at that moment.

Today it happened again: checked log to see what happened:

Code: Select all

[2015-06-15 11:10:24][1420][DEBUG] --------------------------------
[2015-06-15 11:10:24][1431][INFO ] mod_gearman worker daemon started with pid 1431
[2015-06-15 11:10:24][1431][DEBUG] Version 1.5.0b1
[2015-06-15 11:10:24][1431][DEBUG] running on libgearman 1.1.8
[2015-06-15 11:10:24][1431][INFO ] found pid file for: 1431
[2015-06-15 11:10:24][1431][INFO ] pidfile already exists, cannot start!

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Mon Jun 15, 2015 10:22 am
by tgriep
Could you upload the following files so we can review them?

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log
/etc/mod_gearman/mod_gearman_worker.conf

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Tue Jun 16, 2015 1:38 am
by litsupport.box
mod_gearman_worker.conf
mod_gearman_workeredited.txt
I had to cut alot out of the log - it was 22 MB's only left the day it happened again.
tgriep wrote:Could you upload the following files so we can review them?

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log
/etc/mod_gearman/mod_gearman_worker.conf

Re: Worker shows up in Unhandled hosts "mod_gearman_worker d

Posted: Tue Jun 16, 2015 10:07 am
by tgriep
In the log file, it looks like the max-jobs that the worker ran was hit. The setting in your config file is set to 1000. Here is the description of what that option is for.
max-jobs
Controls the amount of jobs a worker will do before he exits.
Use this to control how fast the amount of workers will go down after high load times. Disabled when set to 0. Default: 1000

It look like if this limit is hit, it will restart the worker. In the log file, it looks like that happened on your system but the worker wouldn't restart.
You may need to look in other log files to see why that didn't restart. See if there is any information in the messages or debug logs.

You can try and set the max-jobs to 0 to disable it, that should prevent it from restarting and then it will not have that issue.