Page 1 of 3

gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 8:48 am
by bosecorp
I started noticing that my gearman server is getting behind on JOBs

seems to be having a hard time keeping up

2016-03-03 21:47:28 - 139.68.12.15:4730 - v1.1.8

Queue Name | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------
check_results | 1 | 1255 | 0
hostgroup_gearman_hk1 | 24 | 0 | 9
worker_gearmanhk1 | 1 | 0 | 0
------------------------------------------------------------------------

Re: gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 3:43 pm
by rkennedy
How many CPU's do you have allocated to the gearman machine? Additionally, what's the output of top|head -25?

Re: gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 4:08 pm
by bosecorp
here you go


root@hk-ngmon01:(03-04 05:07): /root
# cat /proc/cpuinfo | grep processor | wc -l
1



top - 05:07:16 up 27 days, 14:54, 1 user, load average: 1.43, 1.09, 0.91
Tasks: 223 total, 3 running, 220 sleeping, 0 stopped, 0 zombie
Cpu(s): 37.1%us, 4.2%sy, 0.0%ni, 58.2%id, 0.0%wa, 0.3%hi, 0.2%si, 0.0%st
Mem: 3916052k total, 3185824k used, 730228k free, 431896k buffers
Swap: 15724540k total, 0k used, 15724540k free, 2110000k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3445 gdm 20 0 392m 63m 9952 R 50.3 1.7 51:37.83 gnome-settings-
5203 nagios 20 0 144m 11m 2128 R 38.7 0.3 0:00.20 check_wmi_plus.
195 root 20 0 0 0 0 S 1.9 0.0 45:54.06 scsi_eh_1
5090 nagios 20 0 71836 3660 3056 S 1.9 0.1 0:00.01 wmic
1 root 20 0 19356 1536 1228 S 0.0 0.0 0:16.62 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:11.28 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0
6 root RT 0 0 0 0 S 0.0 0.0 0:03.69 watchdog/0
7 root 20 0 0 0 0 S 0.0 0.0 1:59.11 events/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events/0
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events_long/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events_power_ef
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cgroup
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr

Re: gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 5:06 pm
by jolson
check_results | 1 | 1255 | 0
There's only one worker for your 1255 checks to run.

Is this a gearman_worker node or is this your gearman server?

Either way, I'd like to see your worker config file:

Code: Select all

cat /etc/mod_gearman2/worker.conf
Once you get that back, we'll see if there's anything wrong with the configuration file.

I'm also interested in your waiting jobs:

Code: Select all

ps -ef | grep gearman | wc -l
ps -ef | grep gearman | tail -n20

Re: gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 7:32 pm
by bosecorp
My worker runs on the same server where gearman runs on

Code: Select all

root@hk-ngmon01:(03-04 08:29): /root
# ps -ef | grep gearman | wc -l
43


root@hk-ngmon01:(03-04 08:29): /root
# ps -ef | grep gearman | tail -n20
nagios   36122 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36134 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36135 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36136 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36146 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36311 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36317 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36318 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36319 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36344 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36345 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36346 38099  0 08:31 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36418 38099  0 08:32 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36445 38099  0 08:32 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36446 38099  0 08:32 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36529 38099  0 08:32 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   36547 38099  0 08:32 ?        00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
root     36601 34571  0 08:32 pts/0    00:00:00 grep gearman
nagios   38099     1  0 04:25 ?        00:00:02 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
gearmand 60583     1  0 Mar03 ?        00:04:12 /usr/sbin/gearmand -d --log-file=/var/log/gearmand/gearmand.log -q builtin --verbose WARNING
root@hk-ngmon01:(03-04 08:29): /root
#



# cat /etc/mod_gearman/mod_gearman_worker.conf
###############################################################################
#
#  Mod-Gearman - distribute checks with gearman
#
#  Copyright (c) 2010 Sven Nierlein
#
#  Worker Module Config
#
###############################################################################

# Identifier, hostname will be used if undefined
identifier=gearmanhk1

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=1

# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_worker.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
#server=10.100.30.113:4730
server=localhost:4730

# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the module should distribute execution of
# eventhandlers.
eventhandler=no


# defines if the module should distribute execution of
# service checks.
services=no


# defines if the module should distribute execution of
# host checks.
hosts=no


# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
hostgroups=gearman_hk1
#hostgroups=name2,name3


# sets a list of servicegroups which will go into seperate
# queues.

# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=Qb86Fg93



# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file


###############################################################################
#
# Worker Config
#
# the following settings are for the worker only and
# will be ignored by the neb module.
#
###############################################################################

# Path to the pidfile. Usually set by the init script
#pidfile=/var/mod_gearman/mod_gearman_worker.pid

# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=60

# Minimum number of worker processes which should
# run at any time.
min-worker=39

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=100

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=1000

# max-age is the threshold for discarding too old jobs. When a new job is older
# than this amount of seconds it will not be executed and just discarded. Set to
# zero to disable this check.
#max-age=0

# defines the rate of spawned worker per second as long
# as there are jobs waiting
spawn-rate=1

# Use this option to disable an extra fork for each plugin execution. Disabling
# this option will reduce the load on the worker host but can lead to problems with
# unclean plugin. Default: yes
fork_on_exec=no

# Use this option to show stderr output of plugins too.
# Default: yes
show_error_output=yes

# Use dup_results_are_passive to set if the duplicate result send to the dupserver
# will be passive or active.
# Default is yes (passive).
#dup_results_are_passive=yes

# When embedded perl has been compiled in, you can use this
# switch to enable or disable the embedded perl interpreter.
enable_embedded_perl=on

# Default value used when the perl script does not have a
# "nagios: +epn" or "nagios: -epn" set.
# Perl scripts not written for epn support usually fail with epn,
# so its better to set the default to off.
use_embedded_perl_implicitly=off

# Cache compiled perl scripts. This makes the worker process a little
# bit bigger but makes execution of perl scripts even faster.
# When turned off, Mod-Gearman will still use the embedded perl
# interpreter, but will not cache the compiled script.
use_perl_cache=on

# path to p1 file which is used to execute and cache the
# perl scripts run by the embedded perl interpreter
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl


# Workarounds

# workaround for rc 25 bug
# duplicate jobs from gearmand result in exit code 25 of plugins
# because they are executed twice and get killed because of using
# the same ressource.
# Sending results (when exit code is 25 ) will be skipped with this
# enabled.
workaround_rc_25=off

Re: gearman - a lot JOBs waiting

Posted: Thu Mar 03, 2016 7:51 pm
by gormank
I tried running my nagios servers (gearman on the same host) on a VM w/ a single core and they were not happy. I went back to 4 cores and now load averages are around .25 or so.

I'm too dim to know how to show the jobs waiting as shown in post 1. Please enlighten me. I did google it, but not much help... I also looked at the output of gearman -H and check_gearman2

Re: gearman - a lot JOBs waiting

Posted: Fri Mar 04, 2016 10:26 am
by jolson
bosecorp,

Could you try allocating one-two more CPUs to the gearman machine?


gormank,

The application you're looking for is either called gearman_top or gearman_top2.

Re: gearman - a lot JOBs waiting

Posted: Fri Mar 04, 2016 10:41 am
by bosecorp
did that,

situation improved for the first hours, but it build up over time again

I have 8 CPUs now with 8GB, but I still see now 800 JOBs

2016-03-04 23:43:28 - 139.68.12.15:4730 - v1.1.8

Queue Name | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------
check_results | 1 | 941 | 1
hostgroup_gearman_hk1 | 39 | 0 | 18
worker_gearmanhk1 | 1 | 0 | 0
------------------------------------------------------------------------

Re: gearman - a lot JOBs waiting

Posted: Fri Mar 04, 2016 2:01 pm
by ssax
Try adjusting your worker configuration to disable debug logging (increases load) and reduce your min-worker to 5 (it will automatically start as many as it needs up to max-worker, no need to consume the extra resources on startup).

Code: Select all

debug=0
min-worker=5

Re: gearman - a lot JOBs waiting

Posted: Fri Mar 04, 2016 2:31 pm
by bosecorp
just implemented that...still doesn't help


2016-03-05 03:31:06 - 139.68.12.15:4730 - v1.1.8

Queue Name | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------
check_results | 1 | 1065 | 1
hostgroup_gearman_hk1 | 33 | 0 | 9
worker_gearmanhk1 | 1 | 0 | 0
------------------------------------------------------------------------