Downtime but Notifications still sent
Downtime but Notifications still sent
So, check out these three images....big downtime scheduled and every service marked as in downtime still had notifications sent....
Any idea why they were still sent? The first image was just a sample, 100+ services were affected 
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Downtime but Notifications still sent
Can you show us the service definition of one of the "problem" services? Can you also make sure the system time/timezone is correct on the server.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Downtime but Notifications still sent
Code: Select all
40-define service {
41- host_name rbns-365-dap01
42: service_description DEV - Admin - Apps Listener
43- use rbns_generic-service-5
44- servicegroups rbns_dev
45- check_command check_by_ssh_procs!1:1!1:1!-a APPS_DEV -C tnslsnr -u appldev!!!!!
46- check_period xi_timeperiod_24x7
47- notification_period xi_timeperiod_24x7
48- register 1
49- }
50-
Code: Select all
2133-define service {
2134: name rbns_generic-service-5
2135- service_description Generic Robins Service(5 Min)
2136- is_volatile 0
2137- max_check_attempts 3
2138- check_interval 5
2139- retry_interval 2
2140- active_checks_enabled 1
2141- passive_checks_enabled 1
2142- check_period 24x7
2143- parallelize_check 1
2144- obsess_over_service 1
2145- check_freshness 0
2146- event_handler_enabled 1
2147- flap_detection_enabled 1
2148- process_perf_data 1
2149- retain_status_information 1
2150- retain_nonstatus_information 1
2151- notification_interval 0
2152- notification_period 24x7
2153- notification_options w,c,u,r,
2154- notifications_enabled 1
2155- register 0
2156-
2157-}
Code: Select all
[root@iss-chi-nag05 ~]# cat /etc/php.ini|grep "timezone"
; Defines the default timezone used by the date functions
date.timezone = US/Central
[root@iss-chi-nag05 ~]# date
Fri Feb 20 10:22:45 CST 2015
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Downtime but Notifications still sent
Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Downtime but Notifications still sent
Umm, everything is on one server(except DB and NDO are offloaded), no gearman, nothing else special. Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?lmiltchev wrote:Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?
Here is the contact and template detail:
Code: Select all
1910-define contact {
1911: contact_name sdnagios
1912- alias ITC Service Desk (Nagios)
1913- host_notifications_enabled 1
1914- service_notifications_enabled 1
1915: host_notification_period sdnagios_notification_times
1916: service_notification_period sdnagios_notification_times
1917- host_notification_options d,u,r,f
1918- service_notification_options w,u,c,r,f
1919: email [email protected]
1920- host_notifications_enabled 1
1921- service_notifications_enabled 1
1922- use xi_contact_generic
1923- }
Code: Select all
27-define contact {
28: name xi_contact_generic
29: contactgroups xi_contactgroup_all
30- host_notification_period xi_timeperiod_24x7
31- service_notification_period xi_timeperiod_24x7
32- host_notification_options d,u,r,f,s
33- service_notification_options w,u,c,r,f,s
34- host_notification_commands xi_host_notification_handler
35- service_notification_commands xi_service_notification_handler
36- register 0
37- }
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Downtime but Notifications still sent
It is possible... I'm not sure.Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?
BTW, you forgot on post the nagios.cfg. Let's take a look at it.
Also, run the following command and show us the output in code wraps:
Code: Select all
grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Downtime but Notifications still sent
I modified your command to limit to just the host in question....otherwise it was 3 times the data
Dangit, always annoys me when you guys don't read entire posts and now I miss something....here is my nagios.cfg....I'll be away for 5 minutes beating myself!
Code: Select all
[root@iss-chi-nag05 ~]# grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'|grep "rbns-365-dap01"
[Thu Feb 19 00:00:00 2015] CURRENT SERVICE STATE: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;1;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:05:11 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:07:10 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:38:29 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:40:27 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;3;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 01:09:08 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STOPPED; Service has exited from a period of scheduled downtime
[root@iss-chi-nag05 ~]#
Code: Select all
# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0
# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static
# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services
# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler
# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=0
check_result_path=/var/nagiosramdisk/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
check_workers=16
#command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
#enable_embedded_perl=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
#external_command_buffer_slots=4096
high_host_flap_threshold=50.0
high_service_flap_threshold=50.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=25.0
low_service_flap_threshold=25.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=4000
max_debug_file_size=1000000
#max_host_check_spread=30
max_host_check_spread=60
#max_service_check_spread=30
max_service_check_spread=60
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/var/nagiosramdisk/objects.cache
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
#p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=480
service_check_timeout_state=u
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
#sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
use_aggressive_host_checking=0
####use_embedded_perl_implicitly=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Downtime but Notifications still sent
Now that I've had time to look further myself, these four lines in the log are interesting to me:
The first one shows it going into a HARD state. The next two are the notifications and the final one is the flexible downtime starting. These all happened at the exact same time, but looks as though nagios processed the notifications first when the flexible downtime started instead of kicking on the downtime first(following the order the lines were written).
Is that how it should process, or is that an ordering bug in the way it was processed?
Code: Select all
[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtimeIs that how it should process, or is that an ordering bug in the way it was processed?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Downtime but Notifications still sent
This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515
The flexible downtime didn't trigger until after the notification
Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568
I believe it only happens with Flexible
http://support.nagios.com/forum/viewtop ... 506#127515
The flexible downtime didn't trigger until after the notification
Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568
I believe it only happens with Flexible
Re: Downtime but Notifications still sent
Scott,scottwilkerson wrote:This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515
The flexible downtime didn't trigger until after the notification
Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568
I believe it only happens with Flexible
Your first link is to this thread again, was that meant? Your tracker link is definitely the same thing though. Guess work around is to not use flexible until the core issue is resolved.
Thanks!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github