Page 3 of 3

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Mon Oct 19, 2015 6:24 pm
by rajasegar
ssax wrote:rajasegar, please post a sanitized copy of your mod_gearman_neb.conf and your mod_gearman_worker.conf files.

Thank you
Is there something we can do to hardcode the start nagios service somewhere until this problem is solved?

Code: Select all

[nagios@nagiosprodxi1 mod_gearman]$ cat mod_gearman_neb.conf
###############################################################################
#
#  Mod-Gearman - distribute checks with gearman
#
#  Copyright (c) 2010 Sven Nierlein
#
#  Mod-Gearman NEB Module Config
#
###############################################################################

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0

# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_neb.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730


# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the module should distribute execution of
# eventhandlers.
eventhandler=yes


# defines if the module should distribute execution of
# service checks.
services=yes


# defines if the module should distribute execution of
# host checks.
hosts=yes


# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
hostgroups=LOAD_BALANCER_MSB
#hostgroups=name2,name3


# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3

# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes

# This settings determines if all eventhandlers go into a single
# 'eventhandlers' queue or into the same queue like normal checks
# would do.
route_eventhandler_like_checks=no

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=should_be_changed


# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file


# use_uniq_jobs
# Using uniq keys prevents the gearman queues from filling up when there
# is no worker. However, gearmand seems to have problems with the uniq
# key and sometimes jobs get stuck in the queue. Set this option to 'off'
# when you run into problems with stuck jobs but make sure your worker
# are running.
use_uniq_jobs=on



###############################################################################
#
# NEB Module Config
#
# the following settings are for the neb module only and
# will be ignored by the worker.
#
###############################################################################

# sets a list of hostgroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localhostgroups=


# sets a list of servicegroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localservicegroups=

# The queue_custom_variable can be used to define the target queue
# by a custom variable in addition to host/servicegroups. When set
# for ex. to 'WORKER' you then could define a '_WORKER' custom
# variable for your hosts and services to directly set the worker
# queue. The host queue is inherited unless overwritten
# by a service custom variable. Set the value of your custom
# variable to 'local' to bypass Mod-Gearman (Same behaviour as in
# localhostgroups/localservicegroups).
#queue_custom_variable=WORKER

# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1
result_workers=2


# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=no

# perfdata mode overwrite helps preventing the perdata queue getting to big
# 1 = overwrote
# 2 = append
perfdata_mode=1

# The Mod-Gearman NEB module will submit a fake result for orphaned host
# checks with a message saying there is no worker running for this
# queue. Use this option to get better reporting results, otherwise your
# hosts will keep their last state as long as there is no worker
# running.
# Default: yes
orphan_host_checks=yes

# Same like 'orphan_host_checks' but for services.
# Default: yes
orphan_service_checks=yes

# When accept_clear_results is enabled, the NEB module will accept unencrypted
# results too. This is quite useful if you have lots of passive checks and make
# use of send_gearman/send_multi where you would have to spread the shared key to
# all clients using these tools.
# Default is no.
accept_clear_results=no

[nagios@nagiosprodxi1 mod_gearman]$

Code: Select all

[nagios@nagiosprodxi1 mod_gearman]$ cat mod_gearman_worker.conf
###############################################################################
#
#  Mod-Gearman - distribute checks with gearman
#
#  Copyright (c) 2010 Sven Nierlein
#
#  Worker Module Config
#
###############################################################################

# Identifier, hostname will be used if undefined
#identifier=hostname

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0

# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_worker.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730


# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the module should distribute execution of
# eventhandlers.
eventhandler=yes


# defines if the module should distribute execution of
# service checks.
services=yes


# defines if the module should distribute execution of
# host checks.
hosts=yes


# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
hostgroups=LOAD_BALANCER_MSB
#hostgroups=name2,name3


# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3

# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=should_be_changed


# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file


###############################################################################
#
# Worker Config
#
# the following settings are for the worker only and
# will be ignored by the neb module.
#
###############################################################################

# Path to the pidfile. Usually set by the init script
#pidfile=/var/mod_gearman/mod_gearman_worker.pid

# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=60

# Minimum number of worker processes which should
# run at any time.
min-worker=60

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=400

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=10000

# max-age is the threshold for discarding too old jobs. When a new job is older
# than this amount of seconds it will not be executed and just discarded. Set to
# zero to disable this check.
#max-age=0

# defines the rate of spawned worker per second as long
# as there are jobs waiting
spawn-rate=60

# Use this option to disable an extra fork for each plugin execution. Disabling
# this option will reduce the load on the worker host but can lead to problems with
# unclean plugin. Default: yes
fork_on_exec=no

# Use this option to show stderr output of plugins too.
# Default: yes
show_error_output=yes

# Use dup_results_are_passive to set if the duplicate result send to the dupserver
# will be passive or active.
# Default is yes (passive).
#dup_results_are_passive=yes

# When embedded perl has been compiled in, you can use this
# switch to enable or disable the embedded perl interpreter.
enable_embedded_perl=on

# Default value used when the perl script does not have a
# "nagios: +epn" or "nagios: -epn" set.
# Perl scripts not written for epn support usually fail with epn,
# so its better to set the default to off.
use_embedded_perl_implicitly=off

# Cache compiled perl scripts. This makes the worker process a little
# bit bigger but makes execution of perl scripts even faster.
# When turned off, Mod-Gearman will still use the embedded perl
# interpreter, but will not cache the compiled script.
use_perl_cache=on

# path to p1 file which is used to execute and cache the
# perl scripts run by the embedded perl interpreter
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl


# Workarounds

# workaround for rc 25 bug
# duplicate jobs from gearmand result in exit code 25 of plugins
# because they are executed twice and get killed because of using
# the same ressource.
# Sending results (when exit code is 25 ) will be skipped with this
# enabled.
workaround_rc_25=off
[nagios@nagiosprodxi1 mod_gearman]$


Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Tue Oct 20, 2015 1:30 pm
by ssax
Try changing result_workers=2 to result_workers=1 in your mod_gearman_neb.conf, we had another customer make this change and it allowed it to work.

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Tue Oct 20, 2015 6:22 pm
by rajasegar
ssax wrote:Try changing result_workers=2 to result_workers=1 in your mod_gearman_neb.conf, we had another customer make this change and it allowed it to work.
Seems to be ok now.

Suddenly my other problem mysteriously disappears.
https://support.nagios.com/forum/viewto ... 81#p157681

There must be some other root cause to this as it was working fine before.
Let me monitor for a few days.

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Wed Oct 21, 2015 9:03 am
by hsmith
Let us know what you come up with. Thanks!

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Wed Oct 21, 2015 6:37 pm
by rajasegar
hsmith wrote:Let us know what you come up with. Thanks!
The problem is gone for now.
However what will happen when we actually need to set the result_workers=2 due to the load?

We will need to get a permanent solution to this.

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Thu Oct 22, 2015 2:43 pm
by tmcdonald
It's hard to say. It is almost certainly mod_gearman causing the crashes, I don't think there is any doubt about that. That puts us in a tricky spot because from our perspective, Nagios is working. If the mod_gearman devs come up with a solution we might count that as a win. However from our standpoint, since I wouldn't call Nagios "broken", all we could do is try to implement some basic sandboxing for brokers (which is a good idea in and of itself) but that's a very complex topic and would not be a quick fix.

The other thing we could do is to take a look, version by version, and see where the crashes begin, then take a look at the code that changed between those versions and see if something can be done better to avoid a crash (or at the very least fail gracefully).

At any rate, our devs are definitely aware of the issue, and mod_gearman's devs have been in (somewhat limited) contact with us in response to our reaching out to them.

Re: V5.2 Issue: Nagios service dies during Apply Configurati

Posted: Thu Oct 22, 2015 6:25 pm
by rajasegar
tmcdonald wrote:It's hard to say. It is almost certainly mod_gearman causing the crashes, I don't think there is any doubt about that. That puts us in a tricky spot because from our perspective, Nagios is working. If the mod_gearman devs come up with a solution we might count that as a win. However from our standpoint, since I wouldn't call Nagios "broken", all we could do is try to implement some basic sandboxing for brokers (which is a good idea in and of itself) but that's a very complex topic and would not be a quick fix.

The other thing we could do is to take a look, version by version, and see where the crashes begin, then take a look at the code that changed between those versions and see if something can be done better to avoid a crash (or at the very least fail gracefully).

At any rate, our devs are definitely aware of the issue, and mod_gearman's devs have been in (somewhat limited) contact with us in response to our reaching out to them.
Thanks for the update. Mod gearman is critical for large installations and this need to be sorted out.
Keep us updated if there is any progress in this issue.

Please close the case.