Downtime but Notifications still sent

Post by **BanditBBS** » Thu Feb 19, 2015 11:53 pm

So, check out these three images....big downtime scheduled and every service marked as in downtime still had notifications sent....

Capture.JPG

Capture2.JPG

Capture3.JPG

Any idea why they were still sent? The first image was just a sample, 100+ services were affected

Post by **lmiltchev** » Fri Feb 20, 2015 11:04 am

Can you show us the service definition of one of the "problem" services? Can you also make sure the system time/timezone is correct on the server.

Post by **BanditBBS** » Fri Feb 20, 2015 11:23 am

Code: Select all

40-define service {
41-     host_name                       rbns-365-dap01
42:     service_description             DEV - Admin - Apps Listener
43-     use                             rbns_generic-service-5
44-     servicegroups                   rbns_dev
45-     check_command                   check_by_ssh_procs!1:1!1:1!-a APPS_DEV -C tnslsnr -u appldev!!!!!
46-     check_period                    xi_timeperiod_24x7
47-     notification_period             xi_timeperiod_24x7
48-     register                        1
49-     }
50-

Code: Select all

2133-define service {
2134:       name                                        rbns_generic-service-5
2135-       service_description                         Generic Robins Service(5 Min)
2136-       is_volatile                                 0
2137-       max_check_attempts                          3
2138-       check_interval                              5
2139-       retry_interval                              2
2140-       active_checks_enabled                       1
2141-       passive_checks_enabled                      1
2142-       check_period                                24x7
2143-       parallelize_check                           1
2144-       obsess_over_service                         1
2145-       check_freshness                             0
2146-       event_handler_enabled                       1
2147-       flap_detection_enabled                      1
2148-       process_perf_data                           1
2149-       retain_status_information                   1
2150-       retain_nonstatus_information                1
2151-       notification_interval                       0
2152-       notification_period                         24x7
2153-       notification_options                        w,c,u,r,
2154-       notifications_enabled                       1
2155-       register                                    0
2156-
2157-}

Code: Select all

[root@iss-chi-nag05 ~]# cat /etc/php.ini|grep "timezone"
; Defines the default timezone used by the date functions
date.timezone = US/Central
[root@iss-chi-nag05 ~]# date
Fri Feb 20 10:22:45 CST 2015

Post by **lmiltchev** » Fri Feb 20, 2015 11:44 am

Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?

Post by **BanditBBS** » Fri Feb 20, 2015 11:49 am

lmiltchev wrote:Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?

Umm, everything is on one server(except DB and NDO are offloaded), no gearman, nothing else special. Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?

Here is the contact and template detail:

Code: Select all

1910-define contact {
1911:   contact_name                            sdnagios
1912-   alias                                   ITC Service Desk (Nagios)
1913-   host_notifications_enabled              1
1914-   service_notifications_enabled           1
1915:   host_notification_period                sdnagios_notification_times
1916:   service_notification_period             sdnagios_notification_times
1917-   host_notification_options               d,u,r,f
1918-   service_notification_options            w,u,c,r,f
1919:   email                                   [email protected]
1920-   host_notifications_enabled              1
1921-   service_notifications_enabled           1
1922-   use                                     xi_contact_generic
1923-   }

Code: Select all

27-define contact {
28:     name                                    xi_contact_generic
29:     contactgroups                           xi_contactgroup_all
30-     host_notification_period                xi_timeperiod_24x7
31-     service_notification_period             xi_timeperiod_24x7
32-     host_notification_options               d,u,r,f,s
33-     service_notification_options            w,u,c,r,f,s
34-     host_notification_commands              xi_host_notification_handler
35-     service_notification_commands           xi_service_notification_handler
36-     register                                0
37-     }

Post by **lmiltchev** » Fri Feb 20, 2015 12:22 pm

Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?

It is possible... I'm not sure.

BTW, you forgot on post the nagios.cfg. Let's take a look at it.

Also, run the following command and show us the output in code wraps:

Code: Select all

grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'

Post by **BanditBBS** » Fri Feb 20, 2015 12:27 pm

I modified your command to limit to just the host in question....otherwise it was 3 times the data

Code: Select all

[root@iss-chi-nag05 ~]# grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'|grep "rbns-365-dap01"
[Thu Feb 19 00:00:00 2015] CURRENT SERVICE STATE: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;1;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:05:11 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:07:10 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:38:29 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:40:27 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;3;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 01:09:08 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STOPPED; Service has exited from a period of scheduled downtime
[root@iss-chi-nag05 ~]#

Dangit, always annoys me when you guys don't read entire posts and now I miss something....here is my nagios.cfg....I'll be away for 5 minutes beating myself!

Code: Select all

# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0


# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg


# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/var/nagiosramdisk/service-perfdata

service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/var/nagiosramdisk/host-perfdata

host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk


# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg


# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static

# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services

# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler



# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=0
check_result_path=/var/nagiosramdisk/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
check_workers=16
#command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
#enable_embedded_perl=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
#external_command_buffer_slots=4096
high_host_flap_threshold=50.0
high_service_flap_threshold=50.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=25.0
low_service_flap_threshold=25.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=4000
max_debug_file_size=1000000
#max_host_check_spread=30
max_host_check_spread=60
#max_service_check_spread=30
max_service_check_spread=60
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/var/nagiosramdisk/objects.cache
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
#p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=480
service_check_timeout_state=u
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
#sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
use_aggressive_host_checking=0
####use_embedded_perl_implicitly=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0

Post by **BanditBBS** » Fri Feb 20, 2015 3:27 pm

Now that I've had time to look further myself, these four lines in the log are interesting to me:

Code: Select all

[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime

The first one shows it going into a HARD state. The next two are the notifications and the final one is the flexible downtime starting. These all happened at the exact same time, but looks as though nagios processed the notifications first when the flexible downtime started instead of kicking on the downtime first(following the order the lines were written).

Is that how it should process, or is that an ordering bug in the way it was processed?

scottwilkerson · Post by **scottwilkerson** » Fri Feb 20, 2015 3:41 pm

This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515

The flexible downtime didn't trigger until after the notification

Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568

I believe it only happens with Flexible

Post by **BanditBBS** » Fri Feb 20, 2015 3:53 pm

scottwilkerson wrote:This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515

The flexible downtime didn't trigger until after the notification

Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568

I believe it only happens with Flexible

Scott,

Your first link is to this thread again, was that meant? Your tracker link is definitely the same thing though. Guess work around is to not use flexible until the core issue is resolved.

Thanks!

Nagios Support Forum

Downtime but Notifications still sent

Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent

Re: Downtime but Notifications still sent