rkennedy wrote:Can you show us the full output of gearman_top2 from the gearman machine?
Code: Select all
2017-01-26 14:56:50 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 112 | 0 | 0
host | 82 | 0 | 0
hostgroup_BPI_HG | 17 | 0 | 0
hostgroup_NetDevCeragonHG | 7 | 0 | 0
hostgroup_NetDevHG | 7 | 0 | 1
hostgroup_mgmon002WorkerHostGroup | 19 | 0 | 2
service | 29 | 0 | 3
servicegroup_CallCenter_SG | 0 | 7 | 0
servicegroup_MSSQL_Service-Group | 11 | 0 | 0
servicegroup_MySQL_Service-Group | 11 | 0 | 0
servicegroup_NetDevS-Scada-Ceragon | 7 | 0 | 0
servicegroup_NetDevS-Scada-R100 | 7 | 0 | 0
servicegroup_NetDevS-Scada-Radio | 7 | 0 | 0
servicegroup_NetDevSG | 7 | 0 | 0
servicegroup_ORACLE_Service-Group | 11 | 0 | 3
servicegroup_Scada_SG | 0 | 10 | 0
servicegroup_Unix-Linux_Infra_Service-Group | 47 | 0 | 14
servicegroup_Unix-Linux_Services_Service-Group | 6 | 0 | 1
servicegroup_mg_service-test | 0 | 6 | 0
worker_mgmon001 | 1 | 0 | 0
worker_mgmon002 | 1 | 0 | 0
worker_mgmon003 | 1 | 0 | 0
worker_mgmon004 | 1 | 0 | 0
worker_mgmon005 | 1 | 0 | 0
worker_mgmon006 | 1 | 0 | 0
worker_mgmon010 | 1 | 0 | 0
-------------------------------------------------------------------------------------------------
What I have also seen are the following errors:
Code: Select all
[root@mgmon001 gearmand]# tailf /var/log/gearmand/gearmand.log
ERROR 2017-00-26 15:22:19.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 15:22:19.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 15:47:19.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 15:47:19.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:13:05.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:13:05.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:28:55.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:28:55.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:53:55.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:53:55.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
Which is very curious of this log is the timestamp - From what I've been reading on google is that there is a bug with gearman log that uses a 'month before' on the log. However, it is using UTC time but not the local time I don't know why it does this.
Code: Select all
[root@ gearmand]# timedatectl
Local time: Thu 2017-01-26 12:56:12 ART
Universal time: Thu 2017-01-26 15:56:12 UTC
RTC time: Thu 2017-01-26 15:56:12
Time zone: America/Argentina/Buenos_Aires (ART, -0300)
NTP enabled: yes
NTP synchronized: yes
RTC in local TZ: no
DST active: n/aAnother weird thing is that on the image that I am uploading you could see that there are like 4 or 5 orphan hosts and all of them are on their 2/4 attemp. However, the duration time is from hours ago - And this is not good since it should have to run the other 2 attemps to get their hard state.
Note: If I force the orphan checks to run inmediatly - they get OK state.
------------------------------------