Strange issue with Nagios stopping
Posted: Fri Sep 08, 2017 5:12 am
Over the past few weeks we have experienced times when Nagios just stops running.
Can't tie it down to anything, nothing on the Nagios server has been updated or installed.
The only thing that we have done is add new hosts and service checks.
When it stops with can take a few attempts to get it back up and running.
We start it and it runs maybe for a minute or so and the dies again.
There is nothing in the log file to indicate why it has stopped.
Has anyone ever seen this?
Is there any logging i can put in place to see why it does it?
Specs are:
Nagios 4.3.2 core
gearmand-0.25-1
Nagvis 1.8.5
Livestatus 1.2.7i3p2
Running on a VM, which it has been running on for over 2 years now.
64G RAM
6 x CPU
I have checked the config files using the -v switch with nagios and there are no errors and no warnings.
###
Running pre-flight check on configuration data...
Checking objects...
Checked 18167 services.
Checked 2554 hosts.
Checked 359 host groups.
Checked 63 service groups.
Checked 140 contacts.
Checked 40 contact groups.
Checked 345 commands.
Checked 33 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 2554 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 33 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
###
Nagios config file:
##########
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/objects
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/ramdisk/status.dat
status_update_interval=30
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/usr/local/nagios/var/ramdisk/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives/naglogs
use_syslog=0
log_notifications=1
log_service_retries=0
log_host_retries=0
log_event_handlers=0
log_initial_states=0
log_current_states=0
log_external_commands=0
log_passive_checks=0
global_host_event_handler=log_host_state_changes
service_inter_check_delay_method=s
max_service_check_spread=40
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=40
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/ramdisk/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=90
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=0
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=0
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
obsess_over_services=1
ocsp_command=global_service_event
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=0
service_freshness_check_interval=60
service_check_timeout_state=u
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=1
enable_environment_macros=1
debug_level=0
debug_verbosity=0
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
host_down_disable_service_checks=1
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live log_file=/usr/local/nagios/var/archives/naglogs/livestatus.log
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf
##############
Thanks in advance.
Tony
Can't tie it down to anything, nothing on the Nagios server has been updated or installed.
The only thing that we have done is add new hosts and service checks.
When it stops with can take a few attempts to get it back up and running.
We start it and it runs maybe for a minute or so and the dies again.
There is nothing in the log file to indicate why it has stopped.
Has anyone ever seen this?
Is there any logging i can put in place to see why it does it?
Specs are:
Nagios 4.3.2 core
gearmand-0.25-1
Nagvis 1.8.5
Livestatus 1.2.7i3p2
Running on a VM, which it has been running on for over 2 years now.
64G RAM
6 x CPU
I have checked the config files using the -v switch with nagios and there are no errors and no warnings.
###
Running pre-flight check on configuration data...
Checking objects...
Checked 18167 services.
Checked 2554 hosts.
Checked 359 host groups.
Checked 63 service groups.
Checked 140 contacts.
Checked 40 contact groups.
Checked 345 commands.
Checked 33 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 2554 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 33 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors: 0
###
Nagios config file:
##########
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/objects
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/ramdisk/status.dat
status_update_interval=30
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/usr/local/nagios/var/ramdisk/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives/naglogs
use_syslog=0
log_notifications=1
log_service_retries=0
log_host_retries=0
log_event_handlers=0
log_initial_states=0
log_current_states=0
log_external_commands=0
log_passive_checks=0
global_host_event_handler=log_host_state_changes
service_inter_check_delay_method=s
max_service_check_spread=40
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=40
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/ramdisk/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=90
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=0
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=0
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
obsess_over_services=1
ocsp_command=global_service_event
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=0
service_freshness_check_interval=60
service_check_timeout_state=u
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=1
enable_environment_macros=1
debug_level=0
debug_verbosity=0
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
host_down_disable_service_checks=1
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live log_file=/usr/local/nagios/var/archives/naglogs/livestatus.log
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf
##############
Thanks in advance.
Tony