Lots of reschduling errors in nagios.log

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
sigmainformatique
Posts: 74
Joined: Mon Apr 23, 2018 8:11 am

Lots of reschduling errors in nagios.log

Post by sigmainformatique »

Hi Nagios Team,

Have an issue with our Nagios.

The checks stay locked, after a while.

Lots of messages like this in Nagios.log :

Code: Select all

[1542669091] Warning: The check of service 'ora-xxxxx-instance' on host 'xxxxxxxxx' looks like it was orphaned (results never came back; last_check=1542652651; next_check=1542668370).  I'm scheduling an immediate check of the service...
(not any "fork()" issue message)
We are using Gearmand.

Consequences : big holes in our graphes, status not updated for hours....



Have applied the solutions in this document :
https://support.nagios.com/kb/article.php?id=19


Our limits.conf on Nagios Server

Code: Select all

# End of file
* hard nofile 10000
* soft nofile 10000

root hard nofile 10000
root soft nofile 10000

#locked memory
* hard memlock 128
* soft memlock 128

#open files
* soft nofile 4096
* hard nofile 4096

#max user processes
* hard nproc 4096
* soft nproc 4096

#stack size
* hard stack 20480
* soft stack 20480
Could you help?
thank you in advance
Last edited by sigmainformatique on Tue Nov 20, 2018 12:42 pm, edited 1 time in total.
sigmainformatique
Posts: 74
Joined: Mon Apr 23, 2018 8:11 am

Re: Lots of reschduling errors in nagios.log

Post by sigmainformatique »

Example of one of our worker.conf :

Code: Select all

debug=0
logfile=/var/log/mod_gearman2/mod_gearman_worker.log
server=xxxxxxxxx:4730
#dupserver=<host>:<port>
eventhandler=yes
services=yes
hosts=yes
#hostgroups=name2,name3
#servicegroups=name1,name2,name3
encryption=yes
key=XXXXXXXXX
#keyfile=/path/to/secret.file
#pidfile=/var/mod_gearman2/mod_gearman_worker.pid
job_timeout=60
min-worker=10
max-worker=50
idle-timeout=300
max-jobs=1000
#max-age=0
spawn-rate=1
fork_on_exec=no
load_limit1=0
load_limit5=0
load_limit15=0
show_error_output=yes
#dup_results_are_passive=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
p1_file=/usr/share/mod_gearman2/mod_gearman_p1.pl
#restrict_path=/usr/local/plugins/
workaround_rc_25=off
sigmainformatique
Posts: 74
Joined: Mon Apr 23, 2018 8:11 am

Re: Lots of reschduling errors in nagios.log

Post by sigmainformatique »

Our module.conf

Code: Select all

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0

# Path to the logfile.
logfile=/var/log/mod_gearman2/mod_gearman_neb.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730


# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the module should distribute execution of
# eventhandlers.
eventhandler=no


# defines if the module should distribute execution of
# service checks.
services=yes


# defines if the module should distribute execution of
# host checks.
hosts=yes


# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
hostgroups=wk_xxxx


# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3

# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes

# This settings determines if all eventhandlers go into a single
# 'eventhandlers' queue or into the same queue like normal checks
# would do.
route_eventhandler_like_checks=no

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=!@NAGIOSXI2018@!


# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file


# use_uniq_jobs
# Using uniq keys prevents the gearman queues from filling up when there
# is no worker. However, gearmand seems to have problems with the uniq
# key and sometimes jobs get stuck in the queue. Set this option to 'off'
# when you run into problems with stuck jobs but make sure your worker
# are running.
use_uniq_jobs=off



###############################################################################
#
# NEB Module Config
#
# the following settings are for the neb module only and
# will be ignored by the worker.
#
###############################################################################

# sets a list of hostgroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localhostgroups=


# sets a list of servicegroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localservicegroups=

# The queue_custom_variable can be used to define the target queue
# by a custom variable in addition to host/servicegroups. When set
# for ex. to 'WORKER' you then could define a '_WORKER' custom
# variable for your hosts and services to directly set the worker
# queue. The host queue is inherited unless overwritten
# by a service custom variable. Set the value of your custom
# variable to 'local' to bypass Mod-Gearman (Same behaviour as in
# localhostgroups/localservicegroups).
#queue_custom_variable=WORKER

# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1
result_workers=1


# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=no

# perfdata mode overwrite helps preventing the perdata queue getting to big
# 1 = overwrote
# 2 = append
perfdata_mode=1

# The Mod-Gearman NEB module will submit a fake result for orphaned host
# checks with a message saying there is no worker running for this
# queue. Use this option to get better reporting results, otherwise your
# hosts will keep their last state as long as there is no worker
# running.
# Default: yes
orphan_host_checks=no

# Same like 'orphan_host_checks' but for services.
# Default: yes
orphan_service_checks=no

# When accept_clear_results is enabled, the NEB module will accept unencrypted
# results too. This is quite useful if you have lots of passive checks and make
# use of send_gearman/send_multi where you would have to spread the shared key to
# all clients using these tools.
# Default is no.
accept_clear_results=no

scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Lots of reschduling errors in nagios.log

Post by scottwilkerson »

do you have a worker setup to process checks for the wk_xxxx hostgroup?

In your module config you are separating them out to a separate queue

Code: Select all

hostgroups=wk_xxxx
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
sigmainformatique
Posts: 74
Joined: Mon Apr 23, 2018 8:11 am

Re: Lots of reschduling errors in nagios.log

Post by sigmainformatique »

Yes!

I have a specific worker for this hostgroup, but the issue was the following parameters were true :

Code: Select all

services=yes
hosts=yes
The difference is a lot of plugin and network are not available for this sepcific worker. I want it only to check for some checks in a specific network.

Work as a charm after changing to:

Code: Select all

services=no
hosts=no
You can mark this thread solved, thank you!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Lots of reschduling errors in nagios.log

Post by scottwilkerson »

sigmainformatique wrote:Yes!

I have a specific worker for this hostgroup, but the issue was the following parameters were true :

Code: Select all

services=yes
hosts=yes
The difference is a lot of plugin and network are not available for this sepcific worker. I want it only to check for some checks in a specific network.

Work as a charm after changing to:

Code: Select all

services=no
hosts=no
You can mark this thread solved, thank you!
great! Marking resolved
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked