Downtime but Notifications still sent

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Downtime but Notifications still sent

Post by BanditBBS »

So, check out these three images....big downtime scheduled and every service marked as in downtime still had notifications sent....
Capture.JPG
Capture2.JPG
Capture3.JPG
Any idea why they were still sent? The first image was just a sample, 100+ services were affected :(
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Downtime but Notifications still sent

Post by lmiltchev »

Can you show us the service definition of one of the "problem" services? Can you also make sure the system time/timezone is correct on the server.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Downtime but Notifications still sent

Post by BanditBBS »

Code: Select all

40-define service {
41-     host_name                       rbns-365-dap01
42:     service_description             DEV - Admin - Apps Listener
43-     use                             rbns_generic-service-5
44-     servicegroups                   rbns_dev
45-     check_command                   check_by_ssh_procs!1:1!1:1!-a APPS_DEV -C tnslsnr -u appldev!!!!!
46-     check_period                    xi_timeperiod_24x7
47-     notification_period             xi_timeperiod_24x7
48-     register                        1
49-     }
50-

Code: Select all

2133-define service {
2134:       name                                        rbns_generic-service-5
2135-       service_description                         Generic Robins Service(5 Min)
2136-       is_volatile                                 0
2137-       max_check_attempts                          3
2138-       check_interval                              5
2139-       retry_interval                              2
2140-       active_checks_enabled                       1
2141-       passive_checks_enabled                      1
2142-       check_period                                24x7
2143-       parallelize_check                           1
2144-       obsess_over_service                         1
2145-       check_freshness                             0
2146-       event_handler_enabled                       1
2147-       flap_detection_enabled                      1
2148-       process_perf_data                           1
2149-       retain_status_information                   1
2150-       retain_nonstatus_information                1
2151-       notification_interval                       0
2152-       notification_period                         24x7
2153-       notification_options                        w,c,u,r,
2154-       notifications_enabled                       1
2155-       register                                    0
2156-
2157-}

Code: Select all

[root@iss-chi-nag05 ~]# cat /etc/php.ini|grep "timezone"
; Defines the default timezone used by the date functions
date.timezone = US/Central
[root@iss-chi-nag05 ~]# date
Fri Feb 20 10:22:45 CST 2015
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Downtime but Notifications still sent

Post by lmiltchev »

Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Downtime but Notifications still sent

Post by BanditBBS »

lmiltchev wrote:Can you also post the contact definition of sdnagios contact (+ any relevant templates that it is using), and the nagios.cfg file? Is your XI server in a distributed environment? Do you have notifications configured as event handlers?
Umm, everything is on one server(except DB and NDO are offloaded), no gearman, nothing else special. Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?

Here is the contact and template detail:

Code: Select all

1910-define contact {
1911:   contact_name                            sdnagios
1912-   alias                                   ITC Service Desk (Nagios)
1913-   host_notifications_enabled              1
1914-   service_notifications_enabled           1
1915:   host_notification_period                sdnagios_notification_times
1916:   service_notification_period             sdnagios_notification_times
1917-   host_notification_options               d,u,r,f
1918-   service_notification_options            w,u,c,r,f
1919:   email                                   [email protected]
1920-   host_notifications_enabled              1
1921-   service_notifications_enabled           1
1922-   use                                     xi_contact_generic
1923-   }

Code: Select all

27-define contact {
28:     name                                    xi_contact_generic
29:     contactgroups                           xi_contactgroup_all
30-     host_notification_period                xi_timeperiod_24x7
31-     service_notification_period             xi_timeperiod_24x7
32-     host_notification_options               d,u,r,f,s
33-     service_notification_options            w,u,c,r,f,s
34-     host_notification_commands              xi_host_notification_handler
35-     service_notification_commands           xi_service_notification_handler
36-     register                                0
37-     }
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Downtime but Notifications still sent

Post by lmiltchev »

Could it have been some odd NDO timing issue where the service alert being communicated first to the XI server instead of the downtime and that allowed the notification to be sent...especially since it was a flex downtime?
It is possible... I'm not sure.

BTW, you forgot on post the nagios.cfg. Let's take a look at it.

Also, run the following command and show us the output in code wraps:

Code: Select all

grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Downtime but Notifications still sent

Post by BanditBBS »

I modified your command to limit to just the host in question....otherwise it was 3 times the data :)

Code: Select all

[root@iss-chi-nag05 ~]# grep "DEV - Admin - Apps Listener" /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e'|grep "rbns-365-dap01"
[Thu Feb 19 00:00:00 2015] CURRENT SERVICE STATE: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;1;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:05:11 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:07:10 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:38:29 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
[Fri Feb 20 00:40:27 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;OK;HARD;3;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 00:40:27 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;OK;xi_service_notification_handler;PROCS OK: 1 process with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Fri Feb 20 01:09:08 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STOPPED; Service has exited from a period of scheduled downtime
[root@iss-chi-nag05 ~]#
Dangit, always annoys me when you guys don't read entire posts and now I miss something....here is my nagios.cfg....I'll be away for 5 minutes beating myself!

Code: Select all

# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0


# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg


# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/var/nagiosramdisk/service-perfdata

service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/var/nagiosramdisk/host-perfdata

host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk


# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg


# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static

# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services

# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler



# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=0
check_result_path=/var/nagiosramdisk/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
check_workers=16
#command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
#enable_embedded_perl=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
#external_command_buffer_slots=4096
high_host_flap_threshold=50.0
high_service_flap_threshold=50.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=25.0
low_service_flap_threshold=25.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=4000
max_debug_file_size=1000000
#max_host_check_spread=30
max_host_check_spread=60
#max_service_check_spread=30
max_service_check_spread=60
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/var/nagiosramdisk/objects.cache
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
#p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=480
service_check_timeout_state=u
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
#sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
use_aggressive_host_checking=0
####use_embedded_perl_implicitly=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Downtime but Notifications still sent

Post by BanditBBS »

Now that I've had time to look further myself, these four lines in the log are interesting to me:

Code: Select all

[Thu Feb 19 22:09:09 2015] SERVICE ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: rbns_nagios_all;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE NOTIFICATION: sdnagios;rbns-365-dap01;DEV - Admin - Apps Listener;CRITICAL;xi_service_notification_handler;PROCS CRITICAL: 0 processes with args 'APPS_DEV', command name 'tnslsnr', UID = 401 (appldev)
[Thu Feb 19 22:09:09 2015] SERVICE DOWNTIME ALERT: rbns-365-dap01;DEV - Admin - Apps Listener;STARTED; Service has entered a period of scheduled downtime
The first one shows it going into a HARD state. The next two are the notifications and the final one is the flexible downtime starting. These all happened at the exact same time, but looks as though nagios processed the notifications first when the flexible downtime started instead of kicking on the downtime first(following the order the lines were written).

Is that how it should process, or is that an ordering bug in the way it was processed?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Downtime but Notifications still sent

Post by scottwilkerson »

This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515

The flexible downtime didn't trigger until after the notification

Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568

I believe it only happens with Flexible
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Downtime but Notifications still sent

Post by BanditBBS »

scottwilkerson wrote:This smells like a Core bug
http://support.nagios.com/forum/viewtop ... 506#127515

The flexible downtime didn't trigger until after the notification

Likely the same as this old bug report.
http://tracker.nagios.org/view.php?id=568

I believe it only happens with Flexible
Scott,

Your first link is to this thread again, was that meant? Your tracker link is definitely the same thing though. Guess work around is to not use flexible until the core issue is resolved.

Thanks!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked