Nagios 2014R1.2 with check_gearman: version 1.4_nagios4 running on libgearman 0.25
Everything was scheduling fine.
After upgrade
Nagios 2014R2.6 with check_gearman: version 1.5.0b1 running on libgearman 1.1.8
Serious scheduling issues howering most of the time around 60 and rare bursts around 800.
CPU seems OK, Memory seems ok so it must be some other bottleneck.
NDOUtils ok, no crashed tables in MySQL.
Can someone advice if there is any specific tuning that needs to be done to get the scheduling back to normal?
Monitoring setup details
- All active checks
- Large number of perl based checks
- 1779 hosts
- 14034 services
- Most checks 5 minute interval.
I have changed all the CPU and Memory checks to 10 minutes interval which resulted with some improvement.
Not sure how long this will last. Based on whats been happening recently, the next apply configuration will bring the scheduling down to about 60 again.
After a few Apply Configurations, it will be back to ok for a while ????
Code: Select all
2015-03-24 19:55:11 - localhost:4730 - v1.1.8
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------------
check_results | 2 | 0 | 0
eventhandler | 75 | 0 | 0
host | 75 | 0 | 1
service | 75 | 0 | 14
worker_nagiosprodxi1 | 1 | 0 | 0
-----------------------------------------------------------------------
Code: Select all
[nagios@nagiosprodxi1 mod_gearman]$ cat mod_gearman_neb.conf
###############################################################################
#
# Mod-Gearman - distribute checks with gearman
#
# Copyright (c) 2010 Sven Nierlein
#
# Mod-Gearman NEB Module Config
#
###############################################################################
# use debug to increase the verbosity of the module.
# Possible values are:
# 0 = only errors
# 1 = debug messages
# 2 = trace messages
# 3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0
# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_neb.log
# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730
# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>
# defines if the module should distribute execution of
# eventhandlers.
eventhandler=yes
# defines if the module should distribute execution of
# service checks.
services=yes
# defines if the module should distribute execution of
# host checks.
hosts=yes
# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=LOAD_BALANCER_MSB
#hostgroups=name2,name3
# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3
# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes
# This settings determines if all eventhandlers go into a single
# 'eventhandlers' queue or into the same queue like normal checks
# would do.
route_eventhandler_like_checks=no
# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes
# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=should_be_changed
# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file
# use_uniq_jobs
# Using uniq keys prevents the gearman queues from filling up when there
# is no worker. However, gearmand seems to have problems with the uniq
# key and sometimes jobs get stuck in the queue. Set this option to 'off'
# when you run into problems with stuck jobs but make sure your worker
# are running.
use_uniq_jobs=on
###############################################################################
#
# NEB Module Config
#
# the following settings are for the neb module only and
# will be ignored by the worker.
#
###############################################################################
# sets a list of hostgroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localhostgroups=
# sets a list of servicegroups which will not be executed
# by gearman. They are just passed through.
# Default is none
localservicegroups=
# The queue_custom_variable can be used to define the target queue
# by a custom variable in addition to host/servicegroups. When set
# for ex. to 'WORKER' you then could define a '_WORKER' custom
# variable for your hosts and services to directly set the worker
# queue. The host queue is inherited unless overwritten
# by a service custom variable. Set the value of your custom
# variable to 'local' to bypass Mod-Gearman (Same behaviour as in
# localhostgroups/localservicegroups).
#queue_custom_variable=WORKER
# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1
result_workers=2
# defines if the module should distribute perfdata
# to gearman.
# Note: processing of perfdata is not part of
# mod_gearman. You will need additional worker for
# handling performance data. For example: pnp4nagios
# Performance data is just written to the gearman
# queue.
# Default: no
perfdata=no
# perfdata mode overwrite helps preventing the perdata queue getting to big
# 1 = overwrote
# 2 = append
perfdata_mode=1
# The Mod-Gearman NEB module will submit a fake result for orphaned host
# checks with a message saying there is no worker running for this
# queue. Use this option to get better reporting results, otherwise your
# hosts will keep their last state as long as there is no worker
# running.
# Default: yes
orphan_host_checks=yes
# Same like 'orphan_host_checks' but for services.
# Default: yes
orphan_service_checks=yes
# When accept_clear_results is enabled, the NEB module will accept unencrypted
# results too. This is quite useful if you have lots of passive checks and make
# use of send_gearman/send_multi where you would have to spread the shared key to
# all clients using these tools.
# Default is no.
accept_clear_results=no
Code: Select all
[nagios@nagiosprodxi1 mod_gearman]$ cat mod_gearman_worker.conf
###############################################################################
#
# Mod-Gearman - distribute checks with gearman
#
# Copyright (c) 2010 Sven Nierlein
#
# Worker Module Config
#
###############################################################################
# Identifier, hostname will be used if undefined
#identifier=hostname
# use debug to increase the verbosity of the module.
# Possible values are:
# 0 = only errors
# 1 = debug messages
# 2 = trace messages
# 3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0
# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_worker.log
# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=localhost:4730
# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>
# defines if the module should distribute execution of
# eventhandlers.
eventhandler=yes
# defines if the module should distribute execution of
# service checks.
services=yes
# defines if the module should distribute execution of
# host checks.
hosts=yes
# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
# sets a list of servicegroups which will go into seperate
# queues.
#servicegroups=name1,name2,name3
# Set this to 'no' if you want Mod-Gearman to only take care of
# servicechecks. No hostchecks will be processed by Mod-Gearman. Use
# this option to disable hostchecks and still have the possibility to
# use hostgroups for easy configuration of your services.
# If set to yes, you still have to define which hostchecks should be
# processed by either using 'hosts' or the 'hostgroups' option.
# Default is Yes.
do_hostchecks=yes
# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes
# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=should_be_changed
# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file
###############################################################################
#
# Worker Config
#
# the following settings are for the worker only and
# will be ignored by the neb module.
#
###############################################################################
# Path to the pidfile. Usually set by the init script
#pidfile=/var/mod_gearman/mod_gearman_worker.pid
# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=60
# Minimum number of worker processes which should
# run at any time.
min-worker=10
# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=400
# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=20
# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=10000
# max-age is the threshold for discarding too old jobs. When a new job is older
# than this amount of seconds it will not be executed and just discarded. Set to
# zero to disable this check.
#max-age=0
# defines the rate of spawned worker per second as long
# as there are jobs waiting
spawn-rate=50
# Use this option to disable an extra fork for each plugin execution. Disabling
# this option will reduce the load on the worker host but can lead to problems with
# unclean plugin. Default: yes
fork_on_exec=no
# Use this option to show stderr output of plugins too.
# Default: yes
show_error_output=yes
# Use dup_results_are_passive to set if the duplicate result send to the dupserver
# will be passive or active.
# Default is yes (passive).
#dup_results_are_passive=yes
# When embedded perl has been compiled in, you can use this
# switch to enable or disable the embedded perl interpreter.
enable_embedded_perl=on
# Default value used when the perl script does not have a
# "nagios: +epn" or "nagios: -epn" set.
# Perl scripts not written for epn support usually fail with epn,
# so its better to set the default to off.
use_embedded_perl_implicitly=off
# Cache compiled perl scripts. This makes the worker process a little
# bit bigger but makes execution of perl scripts even faster.
# When turned off, Mod-Gearman will still use the embedded perl
# interpreter, but will not cache the compiled script.
use_perl_cache=on
# path to p1 file which is used to execute and cache the
# perl scripts run by the embedded perl interpreter
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl
# Workarounds
# workaround for rc 25 bug
# duplicate jobs from gearmand result in exit code 25 of plugins
# because they are executed twice and get killed because of using
# the same ressource.
# Sending results (when exit code is 25 ) will be skipped with this
# enabled.
workaround_rc_25=off
Need to revert back to 200 and 1000.
Code: Select all
[nagios@nagiosprodxi1 etc]$ cat nagios.cfg
# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0
# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/usr/local/nagios/var/service-perfdata
##service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/usr/local/nagios/var/host-perfdata
##host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static
# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services
# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler
# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=0
check_host_freshness=0
check_result_path=/usr/local/nagios/var/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
#command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=euro
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
#enable_embedded_perl=0
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=60
execute_host_checks=1
execute_service_checks=1
#external_command_buffer_slots=4096
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=60
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=0
max_debug_file_size=1000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/usr/local/nagios/var/objects.cache
#object_cache_file=/var/nagiosramdisk/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
#p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=10
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=240
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
#sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_file=/usr/local/nagios/var/status.dat
#status_file=/var/nagiosramdisk/status.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
#temp_path=/var/nagiosramdisk/tmp
use_aggressive_host_checking=0
##use_embedded_perl_implicitly=0
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=0
use_true_regexp_matching=0
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf eventhandler=no
http://support.nagios.com/forum/viewtop ... 16&t=31675