Nagios Core 4.4.3 notification bug
Posted: Thu Aug 15, 2019 2:00 am
Hi,
I assume I've found a reproducable bug. In a nutshell: contact is not notified even the event passes all the filters (https://assets.nagios.com/downloads/nag ... tions.html).
For this, you need a critical service event, at least two contacts specified with different notification periods, like this:
It goes like this:
define host {
host_name ups.acme.local
hostgroups dev-apc
contact_groups +ng_technician_ups,ng_monitoring
notification_period weekday-7-16
}
define service{
hostgroup_name dev-apc
service_description APC
check_command check_snmp_apcups
contact_groups +ng_technician_ups,ng_monitoring_test
first_notification_delay 0 # For testing purposes
}
#### UPS technicians wanted to get notifications almost immediately if problem arise
define contactgroup {
contactgroup_name ng_technician_ups
members nc_technician_ups
}
define contact{
contact_name nc_technician_ups
alias nc_technician_ups
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,w,c,r
host_notification_options u,d,r
host_notification_commands notify-acmeups-host
service_notification_commands notify-acmeups-service
email somebody@somewhere.local
}
#### This contact group should only be notified between on weekdays 08:10 and 16:00
define contactgroup{
contactgroup_name ng_monitoring_test
members nc_monitoring_test
}
define contact{
contact_name nc_monitoring_test
alias nc_monitoring_test
service_notification_period test
host_notification_period test
host_notification_options d,r
service_notification_options u,c,w,r
host_notification_commands notify-monitoring-host
service_notification_commands notify-monitoring-service
email me@somewhere.local
}
define timeperiod{
timeperiod_name test
alias test
monday 08:10-16:00
tuesday 08:10-16:00
wednesday 08:10-16:00
thursday 08:10-16:00
friday 08:10-16:00
}
##### Here is what happens
There is a simulated critical event (active checks disabled, passive checks are sent) between 07:57:46 and 07:59:14.
It goes to HARD state and nc_technician_ups is notified immediately at 07:59:14, which - at that very moment - is fine, because "ng_monitoring_test" notification_period starts at 08:10.
2019.08.15. 07:57:46 EXTERNAL COMMAND: DISABLE_SVC_CHECK;ups.acme.local;APC
2019.08.15. 07:58:19 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:19 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;1;test
2019.08.15. 07:58:50 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:50 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:58:50 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;2;test
2019.08.15. 07:59:02 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:02 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:02 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;3;test
2019.08.15. 07:59:14 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:14 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:14 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;CRITICAL;notify-acmeups-servicee;test
2019.08.15. 07:59:14 SERVICE ALERT: ups.acme.local;APC;CRITICAL;HARD;4;test
As the time passes (08:10, 08:15, 08:20.....), notification is never sent to "ng_monitoring_test".
###### Aftermath
When I send an OK state in, both contacts are notified.
2019.08.15. 08:31:37 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;0;test|
2019.08.15. 08:31:37 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;OK;notify-acmeups-service;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_monitoring_test;ups.acme.local;APC;OK;notify-monitoring-service;test
2019.08.15. 08:31:37 SERVICE ALERT: ups.acme.local;APC;OK;HARD;4;test
I assume that's a notification bug, because it violates the notification rules:
1) at 08:10:00, the service is still in CRITICAL state, the service and the "nc_monitoring_test" contact passes all the filters, but the notification about the problem is not sent out
2) Nagios docs states that "It doesn't make sense to get a recovery notification for something you never knew was a problem.". In this scenario, nc_monitoring_test is notified about the recovery (but it wasn't about the problem).
The goal would be to alert a minimal number of people if problems happens during 7x24 and alert other sets of people in 5x8.
In this demonstrated example, the event started outside the 5x8 window, "7x24 people" were notified, and as we slid into the 5x8 timewindow, "5x8 people" weren't notified.
Thanks for any thoughts!
Have a nice day.
Szabolcs
I assume I've found a reproducable bug. In a nutshell: contact is not notified even the event passes all the filters (https://assets.nagios.com/downloads/nag ... tions.html).
For this, you need a critical service event, at least two contacts specified with different notification periods, like this:
It goes like this:
define host {
host_name ups.acme.local
hostgroups dev-apc
contact_groups +ng_technician_ups,ng_monitoring
notification_period weekday-7-16
}
define service{
hostgroup_name dev-apc
service_description APC
check_command check_snmp_apcups
contact_groups +ng_technician_ups,ng_monitoring_test
first_notification_delay 0 # For testing purposes
}
#### UPS technicians wanted to get notifications almost immediately if problem arise
define contactgroup {
contactgroup_name ng_technician_ups
members nc_technician_ups
}
define contact{
contact_name nc_technician_ups
alias nc_technician_ups
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,w,c,r
host_notification_options u,d,r
host_notification_commands notify-acmeups-host
service_notification_commands notify-acmeups-service
email somebody@somewhere.local
}
#### This contact group should only be notified between on weekdays 08:10 and 16:00
define contactgroup{
contactgroup_name ng_monitoring_test
members nc_monitoring_test
}
define contact{
contact_name nc_monitoring_test
alias nc_monitoring_test
service_notification_period test
host_notification_period test
host_notification_options d,r
service_notification_options u,c,w,r
host_notification_commands notify-monitoring-host
service_notification_commands notify-monitoring-service
email me@somewhere.local
}
define timeperiod{
timeperiod_name test
alias test
monday 08:10-16:00
tuesday 08:10-16:00
wednesday 08:10-16:00
thursday 08:10-16:00
friday 08:10-16:00
}
##### Here is what happens
There is a simulated critical event (active checks disabled, passive checks are sent) between 07:57:46 and 07:59:14.
It goes to HARD state and nc_technician_ups is notified immediately at 07:59:14, which - at that very moment - is fine, because "ng_monitoring_test" notification_period starts at 08:10.
2019.08.15. 07:57:46 EXTERNAL COMMAND: DISABLE_SVC_CHECK;ups.acme.local;APC
2019.08.15. 07:58:19 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:19 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;1;test
2019.08.15. 07:58:50 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:50 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:58:50 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;2;test
2019.08.15. 07:59:02 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:02 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:02 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;3;test
2019.08.15. 07:59:14 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:14 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:14 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;CRITICAL;notify-acmeups-servicee;test
2019.08.15. 07:59:14 SERVICE ALERT: ups.acme.local;APC;CRITICAL;HARD;4;test
As the time passes (08:10, 08:15, 08:20.....), notification is never sent to "ng_monitoring_test".
###### Aftermath
When I send an OK state in, both contacts are notified.
2019.08.15. 08:31:37 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;0;test|
2019.08.15. 08:31:37 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;OK;notify-acmeups-service;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_monitoring_test;ups.acme.local;APC;OK;notify-monitoring-service;test
2019.08.15. 08:31:37 SERVICE ALERT: ups.acme.local;APC;OK;HARD;4;test
I assume that's a notification bug, because it violates the notification rules:
1) at 08:10:00, the service is still in CRITICAL state, the service and the "nc_monitoring_test" contact passes all the filters, but the notification about the problem is not sent out
2) Nagios docs states that "It doesn't make sense to get a recovery notification for something you never knew was a problem.". In this scenario, nc_monitoring_test is notified about the recovery (but it wasn't about the problem).
The goal would be to alert a minimal number of people if problems happens during 7x24 and alert other sets of people in 5x8.
In this demonstrated example, the event started outside the 5x8 window, "7x24 people" were notified, and as we slid into the 5x8 timewindow, "5x8 people" weren't notified.
Thanks for any thoughts!
Have a nice day.
Szabolcs