Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

akio_outori · Post by **akio_outori** » Fri May 01, 2015 1:52 pm

We're seeing what seems like some fairly odd behavior from our Nagios installation. Basically Whenever a reload or restart of the Nagios daemon is triggered, notifications are re-sent for all down services and hosts regardless of whether state retention is enabled or not. By default we use an alert-once scheme (no re-notification), and I'm seeing that retained states are being picked up by the monitoring system on nagios start... Am I missing something or is this expected behavior?

I'm including a copy of our nagios.cfg below, thanks for any insight you can provide

Code: Select all

log_file=/var/log/nagios/nagios.log
cfg_dir=/etc/nagios/objects
cfg_dir=/etc/nagios/clients
cfg_dir=/etc/nagios/hosts
object_cache_file=/dev/shm/nagios/objects.cache
precached_object_file=/dev/shm/nagios/objects.precache
resource_file=/etc/nagios/resource.cfg
status_file=/dev/shm/nagios/status.dat
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/dev/shm/nagios/rw/nagios.cmd
lock_file=/dev/shm/nagios/nagios.lock
temp_file=/dev/shm/nagios/nagios.tmp
temp_path=/dev/shm/nagios/tmp
event_broker_options=-1
broker_module=/usr/libexec/merlin/merlin.so /etc/merlin/merlin.conf
broker_module=/usr/libexec/livestatus/livestatus.o hidden_custom_var_prefix=OP5SECRET__ pnp_path=/opt/monitor/op5/pnp/perfdata /dev/shm/nagios/rw/live
log_rotation_method=d
log_archive_path=/var/log/nagios/archives
use_syslog=0
log_notifications=1
log_service_retries=0
log_host_retries=0
log_event_handlers=1
log_initial_states=0
log_current_states=0
log_external_commands=1
log_passive_checks=0
service_inter_check_delay_method=.10
max_service_check_spread=60
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=60
max_concurrent_checks=100
check_result_path=/dev/shm/nagios/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=60
cached_service_check_horizon=60
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=300
service_check_timeout=120
host_check_timeout=5
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/var/log/nagios/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=1
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=0
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=0
service_freshness_check_interval=60
service_check_timeout_state=c
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=0
free_child_process_memory=0
child_processes_fork_twice=0
debug_level=-1
debug_verbosity=1
debug_file=/var/log/nagios/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0

abrist · Post by **abrist** » Fri May 01, 2015 2:22 pm

This is odd as state retention is enabled in your config. How are you sending the notifications, with the notification handler, event handler, or through an escalation?
If you acknowledge an issue, does it re-alert after a nagios start?

ssax · Post by **ssax** » Fri May 01, 2015 2:23 pm

Please post the output of these commands:

Code: Select all

ls -l /dev/shm/nagios/status.dat
ls -l /var/log/nagios/retention.dat

Also, post a relevant host/service from your /usr/local/nagios/var/objects.cache

akio_outori · Post by **akio_outori** » Fri May 01, 2015 3:02 pm

Notifications are sent via the standard notification handler - i.e. via the notification_commands options specified via contacts. I've included a sample of one of those templates below

Code: Select all

define contact{
        use                             base-contact
        name                            email-contact
        service_notification_commands   service-email
        host_notification_commands	host-email
        register                        0
        service_notification_options    c,r
        host_notification_options	d,r
        }

service-email and host-email are perl notification scripts that generate some pretty html notifications for us, but have no special logic outside of that.

Permissions of the requested files:

Code: Select all

# ls -l /dev/shm/nagios/status.dat
-rw-rw-r-- 1 nagios nagios 6926043 May  1 12:47 /dev/shm/nagios/status.dat

# ls -l /var/log/nagios/retention.dat
-rw-r--r-- 1 nagios nagios 6958858 May  1 12:37 /var/log/nagios/retention.dat

Also, notifications DO still get sent out for acknowledged issues after a start, even with state_retention enabled. A relevant host check (with confidential info removed) would be:

Code: Select all

define host {
	host_name	app26
	alias	app26
	address	x.x.x.x
	parents	sw02
	check_period	24x7
	check_command	check-host-alive
	contact_groups	on-call
	notification_period	24x7
	initial_state	o
	importance	0
	check_interval	0.000000
	retry_interval	1.000000
	max_check_attempts	4
	active_checks_enabled	1
	passive_checks_enabled	1
	obsess	1
	event_handler_enabled	1
	low_flap_threshold	0.000000
	high_flap_threshold	0.000000
	flap_detection_enabled	0
	flap_detection_options	a
	freshness_threshold	120
	check_freshness	0
	notification_options	r,d,f,s
	notifications_enabled	1
	notification_interval	0.000000
	first_notification_delay	0.000000
	stalking_options	n
	process_perf_data	1
	retain_status_information	1
	retain_nonstatus_information	1
	}

jdalrymple · Post by **jdalrymple** » Mon May 04, 2015 11:12 am

I tried to reproduce your problem in a simple lab, but couldn't.

One thought:

Code: Select all

retention_update_interval=60

This number doesn't mean a whole lot I don't think on a well behaved system as I think retention.dat is written out during a clean Nagios exit anyway. Is it possible that during your start/restart that your Nagios isn't exiting properly? Maybe one thing to do is change that number to 1 or 2 from 60 and see if it affects the behavior?

Nagios Support Forum

Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

Re: Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

Re: Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

Re: Nagios Core 4.0.8 / CentOS 6 - State Retention issues?

Re: Nagios Core 4.0.8 / CentOS 6 - State Retention issues?