Nagios Core 4.4.3 notification bug

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
horvathszabolcs
Posts: 3
Joined: Thu Aug 15, 2019 1:05 am

Nagios Core 4.4.3 notification bug

Post by horvathszabolcs »

Hi,

I assume I've found a reproducable bug. In a nutshell: contact is not notified even the event passes all the filters (https://assets.nagios.com/downloads/nag ... tions.html).

For this, you need a critical service event, at least two contacts specified with different notification periods, like this:

It goes like this:

define host {
host_name ups.acme.local
hostgroups dev-apc
contact_groups +ng_technician_ups,ng_monitoring
notification_period weekday-7-16
}

define service{
hostgroup_name dev-apc
service_description APC
check_command check_snmp_apcups
contact_groups +ng_technician_ups,ng_monitoring_test
first_notification_delay 0 # For testing purposes
}

#### UPS technicians wanted to get notifications almost immediately if problem arise

define contactgroup {
contactgroup_name ng_technician_ups
members nc_technician_ups
}

define contact{
contact_name nc_technician_ups
alias nc_technician_ups
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,w,c,r
host_notification_options u,d,r
host_notification_commands notify-acmeups-host
service_notification_commands notify-acmeups-service
email somebody@somewhere.local
}

#### This contact group should only be notified between on weekdays 08:10 and 16:00

define contactgroup{
contactgroup_name ng_monitoring_test
members nc_monitoring_test
}

define contact{
contact_name nc_monitoring_test
alias nc_monitoring_test
service_notification_period test
host_notification_period test
host_notification_options d,r
service_notification_options u,c,w,r
host_notification_commands notify-monitoring-host
service_notification_commands notify-monitoring-service
email me@somewhere.local
}

define timeperiod{
timeperiod_name test
alias test
monday 08:10-16:00
tuesday 08:10-16:00
wednesday 08:10-16:00
thursday 08:10-16:00
friday 08:10-16:00
}

##### Here is what happens

There is a simulated critical event (active checks disabled, passive checks are sent) between 07:57:46 and 07:59:14.
It goes to HARD state and nc_technician_ups is notified immediately at 07:59:14, which - at that very moment - is fine, because "ng_monitoring_test" notification_period starts at 08:10.

2019.08.15. 07:57:46 EXTERNAL COMMAND: DISABLE_SVC_CHECK;ups.acme.local;APC
2019.08.15. 07:58:19 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:19 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;1;test
2019.08.15. 07:58:50 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:58:50 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:58:50 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;2;test
2019.08.15. 07:59:02 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:02 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:02 SERVICE ALERT: ups.acme.local;APC;CRITICAL;SOFT;3;test
2019.08.15. 07:59:14 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;2;test|
2019.08.15. 07:59:14 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 07:59:14 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;CRITICAL;notify-acmeups-servicee;test
2019.08.15. 07:59:14 SERVICE ALERT: ups.acme.local;APC;CRITICAL;HARD;4;test

As the time passes (08:10, 08:15, 08:20.....), notification is never sent to "ng_monitoring_test".

###### Aftermath

When I send an OK state in, both contacts are notified.

2019.08.15. 08:31:37 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;ups.acme.local;APC;0;test|
2019.08.15. 08:31:37 PASSIVE SERVICE CHECK: ups.acme.local;APC;2;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_technician_ups;ups.acme.local;APC;OK;notify-acmeups-service;test
2019.08.15. 08:31:37 SERVICE NOTIFICATION: nc_monitoring_test;ups.acme.local;APC;OK;notify-monitoring-service;test
2019.08.15. 08:31:37 SERVICE ALERT: ups.acme.local;APC;OK;HARD;4;test


I assume that's a notification bug, because it violates the notification rules:
1) at 08:10:00, the service is still in CRITICAL state, the service and the "nc_monitoring_test" contact passes all the filters, but the notification about the problem is not sent out
2) Nagios docs states that "It doesn't make sense to get a recovery notification for something you never knew was a problem.". In this scenario, nc_monitoring_test is notified about the recovery (but it wasn't about the problem).

The goal would be to alert a minimal number of people if problems happens during 7x24 and alert other sets of people in 5x8.
In this demonstrated example, the event started outside the 5x8 window, "7x24 people" were notified, and as we slid into the 5x8 timewindow, "5x8 people" weren't notified.

Thanks for any thoughts!

Have a nice day.

Szabolcs
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Core 4.4.3 notification bug

Post by scottwilkerson »

The service hadn't reached it's notification_interval yet, and ANY notifications (other than recovery) will only be triggered for hosts/services once they reach their notification_interval

You do not have one specified so it would be 60 minutes, or on the next check after 2019.08.15. 08:59:14 in your example.


notification_interval: This directive is used to define the number of "time units" to wait before re-notifying a contact that this host is still down or unreachable. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this host - only one problem notification will be sent out.
https://assets.nagios.com/downloads/nag ... tions.html

Also see "What Filters Must Be Passed In Order For Notifications To Be Sent?"
https://assets.nagios.com/downloads/nag ... tions.html
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
horvathszabolcs
Posts: 3
Joined: Thu Aug 15, 2019 1:05 am

Re: Nagios Core 4.4.3 notification bug

Post by horvathszabolcs »

Hello,

I don't think notification_interval is the problem, because it is set to zero (we're using service templates, notification_interval is inherited from there, I checked now by the livestatus interface, it is really set to zero).

So we have two contacts, first covers 7x24, second covers let's say 5x8. Both contacts assigned to the service.
7x24 and 5x8 are service_notification_periods that are assigned to respective contacts.

1) Critical events happens during 7x24 but before 5x8 window starts.
2) First contact (7x24) is notified immediately. That's okay. 5x8 is not notified, because we're outside its service_notification_periods.
3) Time goes by and eventually, we're happen to be in the 5x8 window.
5x8 contact should be notified immediately when the 5x8 window begins if the problem still persists. Is that right?
That is not happening.

4) Service recovers during "5x8" hours. Both contacts are notified from the RECOVERY, even "5x8", who hasn't got PROBLEM notification before (that's against the notification rules: Note: Notifications about host or service recoveries are only sent out if a notification was sent out for the original problem. It doesn't make sense to get a recovery notification for something you never knew was a problem.")

It's reproducable on 4.4.3 (that's what centos ships) and I haven't found relating fixes in 4.4.4.

How can I troubleshoot further?

I'm using Nagios for 15 years, I'm looking this particular issue for a week, it's pretty reproducable on a freshly installed machine, so I'm really stuck.

Thanks for any advices.

Regards
Szabolcs


scottwilkerson wrote:The service hadn't reached it's notification_interval yet, and ANY notifications (other than recovery) will only be triggered for hosts/services once they reach their notification_interval

You do not have one specified so it would be 60 minutes, or on the next check after 2019.08.15. 08:59:14 in your example.


notification_interval: This directive is used to define the number of "time units" to wait before re-notifying a contact that this host is still down or unreachable. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this host - only one problem notification will be sent out.
https://assets.nagios.com/downloads/nag ... tions.html

Also see "What Filters Must Be Passed In Order For Notifications To Be Sent?"
https://assets.nagios.com/downloads/nag ... tions.html
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Core 4.4.3 notification bug

Post by scottwilkerson »

If you set this value to 0, Nagios will not re-notify contacts about problems for this host - only one problem notification will be sent out.
So after the first notification (of the service, not to a contact) no others would be sent.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
horvathszabolcs
Posts: 3
Joined: Thu Aug 15, 2019 1:05 am

Re: Nagios Core 4.4.3 notification bug

Post by horvathszabolcs »

Hello,

Thanks for your time, Sir!
I understood that notifications are linked to hosts or services (and not in host/service->assigned contact relations). Thanks for the clarification.

So what I try to accomplish (every contact get notification in their respective timeperiod), I have to re-enable the notification_interval (because it triggers the subsequent notification if the problem still persists) and for contacts where renotification might cause problem (integrations) I have to filter out subsequent notifications based on HOSTPROBLEMID/SERVICEPROBLEMID.
$SERVICEPROBLEMID$ A globally unique number associated with the service's current problem state. Every time a service (or host) transitions from an OK or UP state to a problem state, a global problem ID number is incremented by one (1). This macro will be non-zero if the service is currently a non-OK state. State transitions between non-OK states (e.g. WARNING to CRITICAL) do not cause this problem id to increase. If the service is currently in an OK state, this macro will be set to zero (0). Combined with event handlers, this macro could be used to automatically open trouble tickets when services first enter a problem state.
Thanks again!

Szabolcs
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios Core 4.4.3 notification bug

Post by scottwilkerson »

If you want to stop notifications at any point you can also acknowledge the problem.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked