Recovery notifications sent without problem notification

jrouilla · Post by **jrouilla** » Fri Sep 14, 2018 2:49 pm

Environment: Nagios XI 5.5.1, Centos 7.5.1804

Problem: A host goes into an unreachable state or a service goes into an unknown state. The contacts for the service/host are not notified because the service/host notification options do not include these states (unknown/unreachable). When the host or service goes back into
an ok state, the contact is sent the recovery notification. This is incorrect. Recovery notifications should not be sent if the problem
notification was not sent.

Using this config:

Code: Select all

define contact {
        contact_name                            jr2
        alias                                   J
        host_notifications_enabled              1
        service_notifications_enabled           1
        host_notification_period                24x7
        service_notification_period             24x7
        host_notification_options               d,r
        service_notification_options            c,r
        email                                   nobody@example.com
        use                                     generic-contact
        }

define hostgroup {
        hostgroup_name                          NotificationTest
        alias                                   Testing notification bug
        }

define service {
        service_description             NotificationTest
        use                             active-passive-service
        active_checks_enabled           1
        hostgroup_name                  NotificationTest
        check_command                   check_dummy!3!notification test!
        register                        1
        }


define host {
        host_name                       notification-test
        use                             active-passive-host
        check_command                   check_dummy!2!notification test!
        active_checks_enabled           1
        parents                         notification-test-upstream
        alias                           notification-test
        address                         192.168.10.10
        hostgroups                      NotificationTest
        contacts                        jr2
        register                        1
        }

define host {
        host_name                       notification-test-upstream
        use                             active-passive-host
        check_command                   check_dummy!2!notification test!
        active_checks_enabled           1
        alias                           notification-test-upstream
        address                         192.168.10.10
        hostgroups                      NotificationTest
        contacts                        jr2
        register                        1
        }

jr2 will not get a notification for the hard unknown state of the service NotificationTest. If you submit a manual OK/clear for the service, the
recovery notification is sent. Similarly let the host notification-test go into a hard unreachable state. Then manually clear it.
A notification will be sent to jr2.

I can not replicate this on our older nagios instance using: NagiosXI 2014R2.0 Centos 6.5.

In that case submitting a clear/ok for the unreachable state of notification-test or the unknown state of NotificationTest does not send a recovery notification to jr2. This is the desired and documented result per: https://assets.nagios.com/downloads/nag ... tions.html

"The third host or service filter that must be passed is the host- or service-specific notification options. Each service definition contains options that determine whether or not notifications can be sent out for warning states, critical states, and recoveries. Similarly, each host definition contains options that determine whether or not notifications can be sent out when the host goes down, becomes unreachable, or recovers. If the host or service notification does not pass these options, no one gets notified. If it does pass these options, the notification gets passed to the next filter. Note: Notifications about host or service recoveries are only sent out if a notification was sent out for the original problem. It doesn't make sense to get a recovery notification for something you never knew was a problem."

I have verified that the expanded objects as written to object.cache are the same on both the 2014R2.0 and 5.5.1 instances.

Any idea what is happening here? This is a showstopper for us deploying the new nagios instances.

Thanks.

-- rouilj
John Rouillard

ssax · Post by **ssax** » Fri Sep 14, 2018 4:09 pm

I was able to replicate this, a bug report was created here:

https://github.com/NagiosEnterprises/na ... issues/580

Thank you for reporting this!

jrouilla · Post by **jrouilla** » Tue Sep 18, 2018 12:52 pm

Thanks for putting in the bug report on it. When can I expect this to be triaged and worked?

I was expecting that it would be assigned yesterday.

My boss wants to get some sort of ETA on the fix for this issue as it is blocking our rollout of the newer version of nagios.

Thanks.

-- rouilj

ssax · Post by **ssax** » Tue Sep 18, 2018 4:14 pm

It's up to the developers on when they will work on it, it's based on their workload/priorities. Unfortunately, I'm unable to give you an ETA at this time.

You'll need to check the changelog on releases to see if it's been fixed in each release:

https://www.nagios.com/downloads/nagios-xi/change-log/

If you don't see any movement on the bug report then it likely hasn't been worked on yet:

https://github.com/NagiosEnterprises/na ... issues/580

Thank you

jrouilla · Post by **jrouilla** » Mon Oct 01, 2018 9:25 am

It looks like the recovery is sent even if the issue/recovery occur during a scheduled downtime.

Also the change in:

https://github.com/NagiosEnterprises/na ... a2a1740509

referenced in the github ticket would seem to apply only to host checks since it is in the function: check_host_notification_viability,
and only checks a host state without checking any service definition. A similar patch may be needed for:

check_service_notification_viability

to fix: https://github.com/NagiosEnterprises/na ... issues/580 if the 5c9845 is supposed to fix a similar issue for just hosts.

Also how am I supposed to find out what issue caused the 5c9845 checkin to occur? Searching nagioscore on github for 5c9845 doesn't turn up any references to the change (even though that string turns up in issue #580).

Thanks.

-- rouilj

ssax · Post by **ssax** » Mon Oct 01, 2018 4:49 pm

Here is the other one for services:

https://github.com/NagiosEnterprises/na ... 664e2b7eed

If you upgrade to the latest version (XI 5.5.4) it should have both of those patches applied. Have you tried upgrading to the latest to see if it resolves your issue?

jrouilla · Post by **jrouilla** » Thu Oct 04, 2018 8:10 am

We went and built a new test system from scratch with nagios core 4.4.2, NagiosXI 5.5.4,

However before I spend time setting up the rest of the test, I don't think the patches are in this core version.

Nagios core 4.4.2 was released on Aug 16 per: https://www.nagios.org/projects/nagios-core/history/4x/.

The fixes include issue 552, which doesn't reference either of the checkins below. Issues 557 is also fixed but has no checkin references.
The two checkins below don't reference any tickets.

The patches you reference were committed on:

august 21 (for services https://github.com/NagiosEnterprises/na ... 664e2b7eed)

and

august 23rd (for hosts https://github.com/NagiosEnterprises/na ... a2a1740509)

why do you expect those would be in the most recent nagios core/nagiosXI release? Are you guys using a time machine that we don't know about

?

-- rouilj

ssax · Post by **ssax** » Thu Oct 04, 2018 4:27 pm

The patches are applied only when upgrading/installing using the XI 5.5.4 installer, they are not included with Core, they are applied as a patch to the Core sources before compiling.

So for example:

Code: Select all

cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.5.4.tar.gz
tar zxf xi-5.5.4.tar.gz
cd /tmp/nagiosxi/subcomponents/nagioscore
cat patches/fix_service_recovery_email_host_down.patch

Code: Select all

[root@xid nagioscore]# cat patches/fix_service_recovery_email_host_down.patch
diff --git a/base/notifications.c b/base/notifications.c
index d4574c41..74dfc22d 100644
--- a/base/notifications.c
+++ b/base/notifications.c
@@ -591,10 +591,6 @@ int check_service_notification_viability(service *svc, int type, int options) {
                return ERROR;
                }

-       /***** RECOVERY NOTIFICATIONS ARE GOOD TO GO AT THIS POINT *****/
-       if(svc->current_state == STATE_OK)
-               return OK;
-
        /* don't notify contacts about this service problem again if the notification interval is set to 0 */
        if(svc->no_more_notifications == TRUE) {
                log_debug_info(DEBUGL_NOTIFICATIONS, 1, "We shouldn't re-notify contacts about this service problem.\n");
@@ -1501,10 +1497,6 @@ int check_host_notification_viability(host *hst, int type, int options) {
                return ERROR;
                }

-       /***** RECOVERY NOTIFICATIONS ARE GOOD TO GO AT THIS POINT *****/
-       if(hst->current_state == HOST_UP)
-               return OK;
-
        /* check if we shouldn't renotify contacts about the host problem */
        if(hst->no_more_notifications == TRUE) {
                log_debug_info(DEBUGL_NOTIFICATIONS, 1, "We shouldn't re-notify contacts about this host problem.\n");

So the install/upgrade will call /tmp/nagiosxi/subcomponents/nagioscore/apply-patches which applies the patches:

Code: Select all

[root@xid nagioscore]# cat apply-patches
#!/bin/bash -e

pkgname="$1"

# Custom CGIs
cp patches/cgi/*.c "$pkgname/cgi"

# Makefile mods for Custom CGIs
patch "$pkgname/cgi/Makefile.in" < patches/cgi-makefile.patch

# [PATCH] * Fixed services sending recovery emails when they recover if host in down state (#572) (Scott Wilkerson)
patch "$pkgname/base/notifications.c" < patches/fix_service_recovery_email_host_down.patch

Nagios Support Forum

Recovery notifications sent without problem notification

Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification

Re: Recovery notifications sent without problem notification