Page 1 of 1

First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 5:03 am
by edwardvanhaute
Hi

Lately we have been noticing that the "first_notification_delay" config is no longer working correctly for some services.
We have several services set up to send out notifications, these all inherit settings from the same service template.
On the service template we have first_notification_delay set to 5 minutes, to avoid false alerts.
However lately we have been noticing this doesn't work correctly for some (not all) services.

Here is the config of a service for which no notification delay happens:

Code: Select all

define service {
        host_name                       host_A
        service_description             service_A
        use                             template_customer_A
        register                        1
}
Template used:

Code: Select all

define service {
    name                            template_customer_A
    service_description             Customer A CRITITCAL Service through Satellite Nagios
    use                             ALG_generic-service-satnag
    notification_period             SHR_24x7
    contacts                        Jira_24-7
    contact_groups                  customerA-oncall
    register                        0
}
This then uses:

Code: Select all

define service {
    name                            ALG_generic-service-satnag
    service_description             Hosting base template for services via a satelllite Nagios
    use                             ALG_generic-service-active
    check_command                   check_dummy!0!"No data received yet."
    initial_state                   u
    max_check_attempts              1
    active_checks_enabled           0
    passive_checks_enabled          1
    register                        0
}
Which in turn uses this one:

Code: Select all

define service {
    name                            ALG_generic-service-active
    service_description             Hosting base template for active service
    is_volatile                     0
    initial_state                   u
    max_check_attempts              3
    check_interval                  5
    retry_interval                  1
    active_checks_enabled           1
    passive_checks_enabled          1
    check_period                    SHR_24x7
    obsess_over_service             1
    check_freshness                 0
    event_handler_enabled           1
    flap_detection_enabled          1
    process_perf_data               1
    retain_status_information       1
    retain_nonstatus_information    1
    notification_interval           120
    first_notification_delay        5
    notification_period             SHR_workhours_extend
    notification_options            w,c,r,
    notifications_enabled           1
    contacts                        slack
    register                        0
}
As you can see the "first_notification_delay" is set on the top level template, and never overwritten.


Contact definitions:

Code: Select all

define contactgroup {
    contactgroup_name    customerA-oncall
    alias                Customer A OnCall
    members              Customer A oncall,support
}
define contact {
    contact_name                     Customer A oncall
    alias                            Customer A oncall
    host_notifications_enabled       1
    service_notifications_enabled    1
    host_notification_period         SHR_24x7
    service_notification_period      SHR_24x7
    host_notification_options        d,r,
    service_notification_options     w,c,r,
    host_notification_commands       notify-host-by-email,notify-host-by-slack,notify-host-by-text
    service_notification_commands    notify-service-by-email,notify-service-by-slack,notify-service-by-text
    can_submit_commands              0
    email                            [email protected]
    address1                         xxxxxxxxxxxx
}
For above service first notification delay seems to be ignored.
Another service which has almost identical config, first notification delay seems to be respected:

Code: Select all

define service {
        host_name                       host_B
        service_description             service B
        use                             template_customer_B
        register                        1
}
Uses template:

Code: Select all

define service {
    name                            template_customer_B
    service_description             Customer B CRITITCAL Service through Satellite Nagios
    use                             ALG_generic-service-satnag
    max_check_attempts              1
    flap_detection_enabled          0
    notification_period             SHR_24x7
    contacts                        Jira_24-7
    contact_groups                  CustomerB-oncall
    register                        0
}
All other templates are the same as above

Contact definition is also very similar

Code: Select all

define contactgroup {
    contactgroup_name    CustomerB-oncall
    alias                Customer B OnCall
    members              Customer B OnCall,support
}

define contact {
    contact_name                     Customer B OnCall
    alias                            Customer B OnCall
    host_notifications_enabled       1
    service_notifications_enabled    1
    host_notification_period         SHR_24x7
    service_notification_period      SHR_24x7
    host_notification_options        d,r,
    service_notification_options     w,c,r,
    host_notification_commands       notify-host-by-email,notify-host-by-slack,notify-host-by-text
    service_notification_commands    notify-service-by-email,notify-service-by-slack,notify-service-by-text
    can_submit_commands              0
    email                            [email protected]
    address1                         xxxxxxxxxx
}
Nagios XI version: 5.5.3


Any ideas on troubleshooting this issue?
Thanks

Re: First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 7:57 am
by scottwilkerson
The time for "first notification delay" is timed based on the last known OK state

not from the first failure

So if you normally check on 5 minute intervals the service would reach a HARD state at

Code: Select all

5 + 1 + 1 + 1 = 8
Being you have your "first notification delay" set to 5 it would be sent immediately.

Re: First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 8:55 am
by edwardvanhaute
Hi Scott,

These services are received from a satellite nagios. We have it set up so that first SOFT state in satellite nagios is immediately HARD in our central Nagios XI. They should all be set up as having check_interval of 5 minutes (normally).
Following your logic, this should indeed never give a delay. However, this has worked for us successfully in the past.
Was the behaviour changed in recent Nagios versions?

Also, the weird thing is now, that we have a service B on satellite nagios:

Code: Select all

[Sat Sep  1 03:09:22 2018] SERVICE ALERT: host_B;service_B;WARNING;SOFT;1;Warning output B
[Sat Sep  1 03:10:21 2018] SERVICE ALERT: host_B;service_B;WARNING;SOFT;2;Warning output B
[Sat Sep  1 03:11:22 2018] SERVICE ALERT: host_B;service_B;WARNING;HARD;3;Warning output B
We received the Warning state in central Nagios XI at 03:09:23 (HARD warning). This service is set to check every 5 minutes, so this should mean an immediate notification sent out.

However the first notification was only sent out at 03:16:22.



In our service B (on another satellite nagios):

Code: Select all

[Fri Sep  7 11:45:32 2018] SERVICE ALERT: host_A;service_A;CRITICAL;SOFT;1;Critical output A
[Fri Sep  7 11:46:30 2018] SERVICE ALERT: host_A;service_A;CRITICAL;SOFT;2;Critical output A
[Fri Sep  7 11:47:32 2018] SERVICE ALERT: host_A;service_A;CRITICAL;HARD;3;Critical output A
Critical state was received in central Nagios XI at 11:45:33 (HARD critical). Here a notification was sent out without delay, as would be expected as you explained.


Any thoughts on what could be the difference between these two services?
As far as I can make out there is no difference?

Re: First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 9:49 am
by scottwilkerson
This is hard to decipher if there is a difference without seeing the actual configurations from ojcects.cached.

That said, was either of the hosts or any dependencies down during this time?

Re: First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 10:06 am
by edwardvanhaute
Actual config from objects.cache:

Code: Select all

define service {
        host_name       host_B
        service_description     service B
        check_period    SHR_24x7
        check_command   check_dummy!0!"No data received yet."
        contacts        Jira_24-7
        contact_groups  customerB-oncall
        notification_period     SHR_24x7
        initial_state   o
        importance      0
        check_interval  5.000000
        retry_interval  1.000000
        max_check_attempts      1
        is_volatile     0
        parallelize_check       1
        active_checks_enabled   0
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0.000000
        high_flap_threshold     0.000000
        flap_detection_enabled  0
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,w,u,c
        notifications_enabled   1
        notification_interval   120.000000
        first_notification_delay        5.000000
        stalking_options        n
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        }


define service {
        host_name       host_A
        service_description     Service A
        check_period    SHR_24x7
        check_command   check_dummy!0!"No data received yet."
        contacts        Jira_24-7
        contact_groups  customerA-oncall
        notification_period     SHR_24x7
        initial_state   o
        importance      0
        check_interval  5.000000
        retry_interval  1.000000
        max_check_attempts      1
        is_volatile     0
        parallelize_check       1
        active_checks_enabled   0
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0.000000
        high_flap_threshold     0.000000
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,w,u,c
        notifications_enabled   1
        notification_interval   120.000000
        first_notification_delay        5.000000
        stalking_options        n
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        }
None of the hosts were down, and no dependencies are defined.

Only difference I can make out is flap detection is disabled for one of the services.

Re: First Notification Delay not working on some services

Posted: Fri Sep 07, 2018 11:21 am
by scottwilkerson
edwardvanhaute wrote: Only difference I can make out is flap detection is disabled for one of the services.
That's all I see to. I guess it potentially could have an impact, but you should see the flapping entries in the nagios.log if flapping is being detected