Page 1 of 1
Lots of reschduling errors in nagios.log
Posted: Tue Nov 20, 2018 12:39 pm
by sigmainformatique
Hi Nagios Team,
Have an issue with our Nagios.
The checks stay locked, after a while.
Lots of messages like this in Nagios.log :
Code: Select all
[1542669091] Warning: The check of service 'ora-xxxxx-instance' on host 'xxxxxxxxx' looks like it was orphaned (results never came back; last_check=1542652651; next_check=1542668370). I'm scheduling an immediate check of the service...
(not any "fork()" issue message)
We are using Gearmand.
Consequences : big holes in our graphes, status not updated for hours....
Have applied the solutions in this document :
https://support.nagios.com/kb/article.php?id=19
Our limits.conf on Nagios Server
Code: Select all
# End of file
* hard nofile 10000
* soft nofile 10000
root hard nofile 10000
root soft nofile 10000
#locked memory
* hard memlock 128
* soft memlock 128
#open files
* soft nofile 4096
* hard nofile 4096
#max user processes
* hard nproc 4096
* soft nproc 4096
#stack size
* hard stack 20480
* soft stack 20480
Could you help?
thank you in advance
Re: Lots of reschduling errors in nagios.log
Posted: Tue Nov 20, 2018 12:40 pm
by sigmainformatique
Example of one of our worker.conf :
Code: Select all
debug=0
logfile=/var/log/mod_gearman2/mod_gearman_worker.log
server=xxxxxxxxx:4730
#dupserver=<host>:<port>
eventhandler=yes
services=yes
hosts=yes
#hostgroups=name2,name3
#servicegroups=name1,name2,name3
encryption=yes
key=XXXXXXXXX
#keyfile=/path/to/secret.file
#pidfile=/var/mod_gearman2/mod_gearman_worker.pid
job_timeout=60
min-worker=10
max-worker=50
idle-timeout=300
max-jobs=1000
#max-age=0
spawn-rate=1
fork_on_exec=no
load_limit1=0
load_limit5=0
load_limit15=0
show_error_output=yes
#dup_results_are_passive=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
p1_file=/usr/share/mod_gearman2/mod_gearman_p1.pl
#restrict_path=/usr/local/plugins/
workaround_rc_25=off
Re: Lots of reschduling errors in nagios.log
Posted: Tue Nov 20, 2018 12:46 pm
by sigmainformatique
Our module.conf
Code: Select all
# use debug to increase the verbosity of the module.
# Possible values are:
# 0 = only errors
# 1 = debug messages
# 2 = trace messages
# 3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0
# Path to the logfile.
logfile=/var/log/mod_gearman2/mod_gearman_neb.log
# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730
# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>
# defines if the module should distribute execution of
# eventhandlers.
eventhandler=no
# defines if the module should distribute execution of
# service checks.
services=yes
# defines if the module should distribute execution of
# host checks.
hosts=yes
# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
hostgroups=wk_xxxx
# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3
# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes
# This settings determines if all eventhandlers go into a single
# 'eventhandlers' queue or into the same queue like normal checks
# would do.
route_eventhandler_like_checks=no
# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes
# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=!@NAGIOSXI2018@!
# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file
# use_uniq_jobs
# Using uniq keys prevents the gearman queues from filling up when there
# is no worker. However, gearmand seems to have problems with the uniq
# key and sometimes jobs get stuck in the queue. Set this option to 'off'
# when you run into problems with stuck jobs but make sure your worker
# are running.
use_uniq_jobs=off
###############################################################################
#
# NEB Module Config
#
# the following settings are for the neb module only and
# will be ignored by the worker.
#
###############################################################################
# sets a list of hostgroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localhostgroups=
# sets a list of servicegroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localservicegroups=
# The queue_custom_variable can be used to define the target queue
# by a custom variable in addition to host/servicegroups. When set
# for ex. to 'WORKER' you then could define a '_WORKER' custom
# variable for your hosts and services to directly set the worker
# queue. The host queue is inherited unless overwritten
# by a service custom variable. Set the value of your custom
# variable to 'local' to bypass Mod-Gearman (Same behaviour as in
# localhostgroups/localservicegroups).
#queue_custom_variable=WORKER
# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1
result_workers=1
# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=no
# perfdata mode overwrite helps preventing the perdata queue getting to big
# 1 = overwrote
# 2 = append
perfdata_mode=1
# The Mod-Gearman NEB module will submit a fake result for orphaned host
# checks with a message saying there is no worker running for this
# queue. Use this option to get better reporting results, otherwise your
# hosts will keep their last state as long as there is no worker
# running.
# Default: yes
orphan_host_checks=no
# Same like 'orphan_host_checks' but for services.
# Default: yes
orphan_service_checks=no
# When accept_clear_results is enabled, the NEB module will accept unencrypted
# results too. This is quite useful if you have lots of passive checks and make
# use of send_gearman/send_multi where you would have to spread the shared key to
# all clients using these tools.
# Default is no.
accept_clear_results=no
Re: Lots of reschduling errors in nagios.log
Posted: Tue Nov 20, 2018 1:22 pm
by scottwilkerson
do you have a worker setup to process checks for the
wk_xxxx hostgroup?
In your module config you are separating them out to a separate queue
Re: Lots of reschduling errors in nagios.log
Posted: Fri Nov 23, 2018 11:34 am
by sigmainformatique
Yes!
I have a specific worker for this hostgroup, but the issue was the following parameters were true :
The difference is a lot of plugin and network are not available for this sepcific worker. I want it only to check for some checks in a specific network.
Work as a charm after changing to:
You can mark this thread solved, thank you!
Re: Lots of reschduling errors in nagios.log
Posted: Mon Nov 26, 2018 7:56 am
by scottwilkerson
sigmainformatique wrote:Yes!
I have a specific worker for this hostgroup, but the issue was the following parameters were true :
The difference is a lot of plugin and network are not available for this sepcific worker. I want it only to check for some checks in a specific network.
Work as a charm after changing to:
You can mark this thread solved, thank you!
great! Marking resolved