Page 1 of 1

Notification for certain services sporadically not working

Posted: Thu Jan 22, 2015 5:49 am
by MacDschie
Hello,

we use Nagios Core 4.0.8rc1 on Debian Linux. Everything is working fine except for one strange thing: Sometimes, we are missing notification e-mails of a service, mostly when several notifications in a short time period arrive. The error happens very sporadically, so it's hard to grasp.

Today, we had another incident:
- At 06:52, there was a new error in an error log which is supervised via Nagios. The Notification was sent as expected and at 6:57, the service returned in the OK state, also as expected and we received a notification.
- At 07:42, another error occurred and this shows up in the Nagios Alert history of the service, but not in the Notifications log.

We also have enabled the Nagios debug log:

...
[Thu Jan 22 06:57:02 2015.232885] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 06:57:02 2015.232914] [032.0] [pid=14821] Notification viability test passed.
...
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
...

There are two things which are puzzling me:

1. The log message "Not enough time has elapsed since..." points to the parameter first_notification_delay, which would be set to a high enough value so that we don't receive a notification for the second error. But this parameter is set to 0 in the Nagios configs and also shows up as 0.0000 in the objects cache. So there's no reason not to send this notification.
2. The last notification time stamp is Thu Jan 1 01:00:00 1970. Our server is running on GMT+1, so that would explain the hour 01. But there were sent out notifications since the last program start (the last one was at 06:57 with state change 2 -> 0, the one before was at 06:52 with state change 0 -> 2), so why is this value still set to 1970?

Here are the corresponding host and service configurations from the objects cache:

define host {
host_name cs-9998
alias adnymics - Test-Server (Scanner)
address 9998
parents Internet
check_period 24x7
check_command check-ssh-tunnel
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image server.png
icon_image_alt CustomerServer
statusmap_image server.png
action_url /nagiosgraph/cgi-bin/showhost.cgi?host=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}

define service {
host_name cs-9998
service_description ErrorLog
check_period 24x7
check_command check_dummy!3!"Service is stale"
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 10.000000
retry_interval 2.000000
max_check_attempts 1
is_volatile 0
parallelize_check 1
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 1050
check_freshness 1
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
retain_status_information 1
retain_nonstatus_information 1
}

As I said, normally all is working fine, so I suppose this is not a configuration problem. Is there anyone who can help?

Re: Notification for certain services sporadically not worki

Posted: Fri Jan 23, 2015 12:07 pm
by scottwilkerson
By chance is the host down when you aren't receiving notifications for the service?

This would be expected behavior.

Or do you have any dependencies or parents setup for the host these services are on?

If your host has parents and the parent is down, you would not receive notifications.

Re: Notification for certain services sporadically not worki

Posted: Tue Jan 27, 2015 12:14 pm
by MacDschie
Thanks for your answer.

Host and parents are up all the time when the problem happened and there are no service dependencies.

Re: Notification for certain services sporadically not worki

Posted: Tue Jan 27, 2015 6:02 pm
by slansing
Is it that you are sporadically not seeing those service notifications what so ever? Even in the notifications list? Or in the logs? Or are you just not reviving them on your end?

Re: Notification for certain services sporadically not worki

Posted: Thu Jan 29, 2015 12:03 pm
by MacDschie
If the notification is missing, we don't see it anywhere. We see the alert in the alert history but no notification in the notifications list and we get no e-mail. In the debug log, we see the entry I posted above:
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
This is strange, because we have set the first_notification_delay to 0 and this debug message only appears when this parameter is set to some value > 0. We also had a look in the source code and verified this. There's only one place where the message "Not enough time has elapsed since the service..." is logged and this is after the check for the last notification of this service and of first_notification_delay.

Re: Notification for certain services sporadically not worki

Posted: Thu Jan 29, 2015 5:39 pm
by scottwilkerson
MacDschie wrote:If the notification is missing, we don't see it anywhere. We see the alert in the alert history but no notification in the notifications list and we get no e-mail. In the debug log, we see the entry I posted above:
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
This is strange, because we have set the first_notification_delay to 0 and this debug message only appears when this parameter is set to some value > 0. We also had a look in the source code and verified this. There's only one place where the message "Not enough time has elapsed since the service..." is logged and this is after the check for the last notification of this service and of first_notification_delay.
Can we set the first_notification_delay to 1 or at least just 0 not 0.0000, I'm not positive without digging into the code, but it is possible that this isn't a float value.

Re: Notification for certain services sporadically not worki

Posted: Fri Jan 30, 2015 10:59 am
by MacDschie
In the config, the parameter is set to 0 (not to 0.0000):

Code: Select all

define service {
                name                                  generic-service
                register                              0
                action_url                            /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
                max_check_attempts                    1
                check_interval                        10
                retry_interval                        2
        -->     first_notification_delay              0   <---
                notification_interval                 0
                notification_options                  w,u,c,r
                active_checks_enabled                 1
                passive_checks_enabled                1
                notifications_enabled                 1
                check_freshness                       0
                check_period                          24x7
                contact_groups                        helpdesk
}

define service {
                name                                  passive-service
                register                              0
                active_checks_enabled                 0
                passive_checks_enabled                1
                check_freshness                       1
                freshness_threshold                   1050
                use                                   generic-service
}

define service {
                service_description                   ErrorLog
                check_command                         check_dummy!3!"Service is stale"
                hostgroup_name                        customer-servers
                servicegroups                         customer-server-services
                use                                   passive-service
                contact_groups                        +admins
}

The text I posted above is taken out of the objects cache (/usr/local/nagios/var/objects.cache). This file displays several other parameters and intervals as float, too. I have no clue if this has any impact on the functionality. If I set the parameter to 1 in the config, it is displayed as 1.0000 in the objects.cache.

Re: Notification for certain services sporadically not worki

Posted: Fri Jan 30, 2015 3:19 pm
by scottwilkerson
Got, i, I thought you were sharing your config earlier.

Can we set it to 1 for now, it will not actually delay the message as the time is calculated from the 1st non-ok state.

Honestly I've never come across another person experiencing this before

Re: Notification for certain services sporadically not worki

Posted: Fri Feb 13, 2015 1:14 pm
by MacDschie
Hi scottwilkerson,

we did set first_notification_delay to 1, but this worsened the problem. We had some errors which occurred repeatedly in an interval of roughly 10 minutes and we got no mail notifications at all for days. The debug log always showed the same messages:

Code: Select all

[Fri Feb 13 16:01:14 2015.723796] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'WC7855-10002', Service: 'PrinterAlerts', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan  1 01:00:00 1970
[Fri Feb 13 16:01:14 2015.723835] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Fri Feb 13 16:01:14 2015.723843] [032.0] [pid=14821] Notification viability test failed.  No notification will be sent out.
In every single incident, the Last Notification time was set to Thu Jan 1 01:00:00 1970. But this service has produced notifications before. So there should be some timestamp except for the epoch, shouldn't it?

Re: Notification for certain services sporadically not worki

Posted: Sun Feb 15, 2015 1:31 pm
by scottwilkerson
does your nagios.cfg have?

Code: Select all

retain_state_information=1