We have a Nagios 4.3.1 install and have found that Nagios is not sending notifications on recovery of a service after scheduled downtime has ended. We've enabled debug output for notifications and found Nagios would log "We shouldn't notify about this recovery" when it would perform the notification viability test at the time of recovery.
We believe it's been occurring ever since we enabled the Scheduled Downtime notifications many months ago, but it was not fully realized until it began to show up in our reports. This is happening for every service that enters Scheduled Downtime in an OK state, goes to WARN/CRIT during scheduled downtime, and then remains in that WARN/CRIT state while exiting scheduled downtime. The return of the service to OK is never notified which is breaking our automated resolution process.
In looking through the source, I found two occurrences of this debug logging. I'm assuming this is happening because the svc->notified_on flag is either not set or it's set to 0?
https://github.com/NagiosEnterprises/na ... #L540-L543
https://github.com/NagiosEnterprises/na ... #L703-L707
Any idea of something we have misconfigured?
---------------------------------------------------------------------------------------------
Example timeline:
Wednesday, April 1, 2020 6:59:59 PM GMT-04:00 DST
- Service is in OK state. Service enters Scheduled Downtime. Notification is sent.
Wednesday, April 1, 2020 7:35:07 PM GMT-04:00 DST
- Check goes into CRITICAL state while in Scheduled Downtime. Notification is not sent, as expected.
Thursday, April 2, 2020 7:05:00 AM GMT-04:00 DST
- Scheduled Downtime ends. Notification is sent that the check is in a CRITICAL state.
Thursday, April 2, 2020 7:05:08 AM GMT-04:00 DST
- Check goes into OK state but no notification is sent. "We shouldn't notify about this recovery."
Code: Select all
[1585781999] SERVICE DOWNTIME ALERT: host150;Check_Process_Nodename_1__TP2_host150;STARTED; Service has entered a period of scheduled downtime
[1585781999] SERVICE NOTIFICATION: nagiosadmin;host150;Check_Process_Nodename_1__TP2_host150;DOWNTIMESTART (OK);notify-service-by-webhook;PROCS OK: 1 process with args '...'
[1585784107] SERVICE ALERT: host150;Check_Process_Nodename_1__TP2_host150;CRITICAL;HARD;1;PROCS CRITICAL: 0 processes with args '...'[1585825499] SERVICE DOWNTIME ALERT: host150;Check_Process_Nodename_1__TP2_host150;STOPPED; Service has exited from a period of scheduled downtime
[1585825500] SERVICE NOTIFICATION: nagiosadmin;host150;Check_Process_Nodename_1__TP2_host150;DOWNTIMEEND (CRITICAL);notify-service-by-webhook;PROCS CRITICAL: 0 processes with args '...'
[1585825508] SERVICE ALERT: host150;Check_Process_Nodename_1__TP2_host150;OK;HARD;1;PROCS OK: 1 process with args '...'
Code: Select all
[1585825499.999996] [032.0] [pid=62128] ** Service Notification Attempt ** Host: 'host150', Service: 'Check_Process_Nodename_1__TP2_host150', Type: DOWNTIMEEND, Options: 0, Current State: 2, Last Notification: Wed Dec 31 19:00:00
1969
[1585825500.000060] [032.0] [pid=62128] Notification viability test passed.
[1585825500.000062] [032.1] [pid=62128] Current notification number: 0 (unchanged)
[1585825500.000065] [032.2] [pid=62128] Creating list of contacts to be notified.
[1585825500.000067] [032.1] [pid=62128] Service notification will NOT be escalated.
[1585825500.000070] [032.1] [pid=62128] Adding normal contacts for service to notification list.
[1585825500.000082] [032.2] [pid=62128] Adding members of contact group 'nagiosadmin' for service to notification list.
[1585825500.000085] [032.2] [pid=62128] ** Checking service notification viability for contact 'nagiosadmin'...
[1585825500.000091] [032.2] [pid=62128] Adding contact 'nagiosadmin' to notification list.
[1585825500.000105] [032.2] [pid=62128] ** Notifying contact 'nagiosadmin'
[1585825500.000110] [032.2] [pid=62128] Raw notification command: /usr/bin/notification ...
[1585825500.000163] [032.0] [pid=62128] 1 contacts were notified.
[1585825508.282782] [032.0] [pid=62128] ** Service Notification Attempt ** Host: 'host150', Service: 'Check_Process_Nodename_1__TP2_host150', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Wed Dec 31 19:00:00 1969
[1585825508.282883] [032.1] [pid=62128] We shouldn't notify about this recovery.
[1585825508.282887] [032.0] [pid=62128] Notification viability test failed. No notification will be sent out.
Code: Select all
define contact {
contact_name nagiosadmin
alias nagiosadmin
service_notification_period 24x7
host_notification_period 24x7
service_notification_options c,r,w,u,f,s
host_notification_options d,r,u,f,s
service_notification_commands notify-service-by-webhook
host_notification_commands notify-host-by-webhook
host_notifications_enabled 1
service_notifications_enabled 1
}
Code: Select all
define contactgroup {
contactgroup_name admin
alias Administrators
members nagiosadmin,linuxadmins
}
Code: Select all
define service {
use PRD Service Template
contacts appadmin
notification_period 07:05-19:00 MTWTFxx
check_command check_nrpe!5659!check_procs -a '...'
contact_groups admin
host_name host150
service_description Check_Process_Nodename_1__TP2_host150
}