Page 1 of 2

host checks are being done by different gearman

Posted: Wed Jul 29, 2015 1:55 pm
by bosecorp
I have few host that for some reason the checks are being done by a new gearman/worker that I just recently setup. I checked some of the devices running on this new worker and such devices are not configure to run in this new worker.

we have setup gearman/worker in the past and we never had an issue.

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 2:06 pm
by jdalrymple
I've never seen gearman do this before either. How did you verify? What is the output of gearman_top on your new server? It should only show the host and service queues right?

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 2:33 pm
by bosecorp
these are the logs from my the new gearman

5:23:01][26946][DEBUG] child started with pid: 26946
[2015-07-29 15:23:01][26943][DEBUG] child started with pid: 26943
[2015-07-29 15:23:32][27086][DEBUG] child started with pid: 27086
[2015-07-29 15:23:32][27091][DEBUG] child started with pid: 27091
[2015-07-29 15:23:32][27089][DEBUG] child started with pid: 27089
[2015-07-29 15:23:33][27090][DEBUG] child started with pid: 27090
[2015-07-29 15:23:33][27088][DEBUG] child started with pid: 27088
[2015-07-29 15:23:33][27087][DEBUG] child started with pid: 27087
[2015-07-29 15:24:03][27228][DEBUG] child started with pid: 27228
[2015-07-29 15:24:04][27233][DEBUG] child started with pid: 27233
[2015-07-29 15:24:04][27231][DEBUG] child started with pid: 27231
[2015-07-29 15:24:04][27232][DEBUG] child started with pid: 27232
[2015-07-29 15:24:04][27230][DEBUG] child started with pid: 27230
[2015-07-29 15:24:04][27229][DEBUG] child started with pid: 27229
[2015-07-29 15:24:27][5071][INFO ] no checks in 2minutes, restarting all workers
[2015-07-29 15:24:31][27353][DEBUG] child started with pid: 27353
[2015-07-29 15:24:31][27356][DEBUG] child started with pid: 27356
[2015-07-29 15:24:31][27357][DEBUG] child started with pid: 27357
[2015-07-29 15:24:31][27354][DEBUG] child started with pid: 27354
[2015-07-29 15:24:31][27355][DEBUG] child started with pid: 27355
[2015-07-29 15:24:35][27377][DEBUG] child started with pid: 27377
[2015-07-29 15:25:02][27586][DEBUG] child started with pid: 27586
[2015-07-29 15:25:02][27589][DEBUG] child started with pid: 27589
[2015-07-29 15:25:02][27590][DEBUG] child started with pid: 27590
[2015-07-29 15:25:02][27587][DEBUG] child started with pid: 27587
[2015-07-29 15:25:02][27588][DEBUG] child started with pid: 27588
[2015-07-29 15:25:06][27621][DEBUG] child started with pid: 27621
[2015-07-29 15:25:33][27770][DEBUG] child started with pid: 27770
[2015-07-29 15:25:33][27774][DEBUG] child started with pid: 27774
[2015-07-29 15:25:33][27773][DEBUG] child started with pid: 27773
[2015-07-29 15:25:33][27772][DEBUG] child started with pid: 27772
[2015-07-29 15:25:33][27771][DEBUG] child started with pid: 27771
[2015-07-29 15:25:37][27795][DEBUG] child started with pid: 27795
[2015-07-29 15:25:55][27774][DEBUG] got host job: usfrcc-cave-2960.bose.com
[2015-07-29 15:26:04][27916][DEBUG] child started with pid: 27916
[2015-07-29 15:26:04][27919][DEBUG] child started with pid: 27919
[2015-07-29 15:26:04][27918][DEBUG] child started with pid: 27918
[2015-07-29 15:26:04][27917][DEBUG] child started with pid: 27917
[2015-07-29 15:26:08][27938][DEBUG] child started with pid: 27938


and this is the gearman_top command

2015-07-29 15:33:34 - 10.100.30.111:4730 - v0.33

Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 50 | 0 | 0
host | 165 | 0 | 1
hostgroup_gearman_dca1 | 5 | 0 | 0
hostgroup_gearman_dce1 | 58 | 0 | 22
hostgroup_gearman_dcn1 | 41 | 0 | 9
hostgroup_gearman_fdc | 6 | 0 | 0
hostgroup_gearman_mi1 | 5 | 0 | 0
service | 50 | 0 | 4
worker_gearmandca1 | 1 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_gearmanmi1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 2:36 pm
by jdalrymple
Let's take a peak at

Code: Select all

/etc/mod_gearman/mod_gearman_worker.conf
on the new server if you don't mind?

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:00 pm
by bosecorp
##############################################################################
#
# Mod-Gearman - distribute checks with gearman
#
# Copyright (c) 2010 Sven Nierlein
#
# Worker Module Config
#
###############################################################################

# Identifier, hostname will be used if undefined
identifier=gearmandca1

# use debug to increase the verbosity of the module.
# Possible values are:
# 0 = only errors
# 1 = debug messages
# 2 = trace messages
# 3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=1

# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_worker.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=mynagiosxiserver:4730


# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the worker should execute eventhandlers.
eventhandler=no


# defines if the worker should execute
# service checks.
services=no


# defines if the worker should execute
# host checks.
hosts=yes


# sets a list of hostgroups which this worker will work
# on. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
hostgroups=gearman_dca1

# sets a list of servicegroups which this worker will
# work on.
#servicegroups=gearman_dce1

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=mykey


# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file

# Path to the pidfile. Usually set by the init script
#pidfile=/var/mod_gearman/mod_gearman_worker.pid

# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=60

# Minimum number of worker processes which should
# run at any time.
min-worker=5

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=200

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=1000

# max-age is the threshold for discarding too old jobs. When a new job is older
# than this amount of seconds it will not be executed and just discarded. Set to
# zero to disable this check.
#max-age=0

# defines the rate of spawned worker per second as long
# as there are jobs waiting
spawn-rate=1

# Use this option to disable an extra fork for each plugin execution. Disabling
# this option will reduce the load on the worker host but can lead to problems with
# unclean plugin. Default: yes
fork_on_exec=no

# Set a limit based on the 1min load average. When exceding the load limit,
# no new worker will be started until the current load is below the limit.
# No limit will be used when set to 0.
load_limit1=0

# Same as load_limit1 but for the 5min load average.
load_limit5=0

# Same as load_limit1 but for the 15min load average.
load_limit15=0

# Use this option to show stderr output of plugins too.
# Default: yes
show_error_output=yes

# Use dup_results_are_passive to set if the duplicate result send to the dupserver
# will be passive or active.
# Default is yes (passive).
#dup_results_are_passive=yes

# When embedded perl has been compiled in, you can use this
# switch to enable or disable the embedded perl interpreter.
enable_embedded_perl=on

# Default value used when the perl script does not have a
# "nagios: +epn" or "nagios: -epn" set.
# Perl scripts not written for epn support usually fail with epn,
# so its better to set the default to off.
use_embedded_perl_implicitly=off

# Cache compiled perl scripts. This makes the worker process a little
# bit bigger but makes execution of perl scripts even faster.
# When turned off, Mod-Gearman will still use the embedded perl
# interpreter, but will not cache the compiled script.
use_perl_cache=on

# path to p1 file which is used to execute and cache the
# perl scripts run by the embedded perl interpreter
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl


# Workarounds

# workaround for rc 25 bug
# duplicate jobs from gearmand result in exit code 25 of plugins
# because they are executed twice and get killed because of using
# the same ressource.
# Sending results (when exit code is 25 ) will be skipped with this
# enabled.
workaround_rc_25=on

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:11 pm
by jdalrymple
So everything looks to be pretty much in order - I guess the question is how do you know this worker is picking up jobs it's not supposed to? It is configured for the main host queue and also the dca1 queue.

Is usfrcc-cave-2960.bose.com in one of the alternative queues?

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:24 pm
by bosecorp
because the new worker does not have the new plug ins yet. we are still in the process of completing the configuration of this new worker. some times the check fails because the plug in doesn't exist in this worker yet.



the other way I tell is by looking at the gearman logs on the new worker server

[2015-07-29 16:21:25][12342][DEBUG] got host job: dcn-fab-a-117.bose.com
[2015-07-29 16:21:44][12474][DEBUG] child started with pid: 12474
[2015-07-29 16:21:45][12484][DEBUG] child started with pid: 12484
[2015-07-29 16:21:45][12485][DEBUG] child started with pid: 12485
[2015-07-29 16:21:50][12504][DEBUG] child started with pid: 12504

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:28 pm
by jdalrymple
I'm a bit confused. Is the problem that the worker is doing checks for the wrong hosts, or is the problem that it's doing checks for hosts at all.

Based upon your configuration, this new worker should be doing host checks and also should be doing host checks for that dca1 hostgroup. If you want it to stop doing work, you have to stop the worker or reconfigure it to not perform ANY host or service checks. Easier to just stop the worker.

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:34 pm
by bosecorp
the problem is that the worker is doing checks for the wrong hosts

Re: host checks are being done by different gearman

Posted: Wed Jul 29, 2015 3:45 pm
by jdalrymple
According to the last profile I have of yours, the dcn host is in only one hostgroup, and that hostgroup is not in a worker queue.
The other one is in no hostgroups at all so it would get caught by the default queue.

So unless your hostgroups have changed since that profile was sent, the worker is behaving as expected. If you have made some adjustments you think we should look at I suggest moving this over to a ticket so we can review your profile.zip.