Notification for certain services sporadically not working
Notification for certain services sporadically not working
Hello,
we use Nagios Core 4.0.8rc1 on Debian Linux. Everything is working fine except for one strange thing: Sometimes, we are missing notification e-mails of a service, mostly when several notifications in a short time period arrive. The error happens very sporadically, so it's hard to grasp.
Today, we had another incident:
- At 06:52, there was a new error in an error log which is supervised via Nagios. The Notification was sent as expected and at 6:57, the service returned in the OK state, also as expected and we received a notification.
- At 07:42, another error occurred and this shows up in the Nagios Alert history of the service, but not in the Notifications log.
We also have enabled the Nagios debug log:
...
[Thu Jan 22 06:57:02 2015.232885] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 06:57:02 2015.232914] [032.0] [pid=14821] Notification viability test passed.
...
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
...
There are two things which are puzzling me:
1. The log message "Not enough time has elapsed since..." points to the parameter first_notification_delay, which would be set to a high enough value so that we don't receive a notification for the second error. But this parameter is set to 0 in the Nagios configs and also shows up as 0.0000 in the objects cache. So there's no reason not to send this notification.
2. The last notification time stamp is Thu Jan 1 01:00:00 1970. Our server is running on GMT+1, so that would explain the hour 01. But there were sent out notifications since the last program start (the last one was at 06:57 with state change 2 -> 0, the one before was at 06:52 with state change 0 -> 2), so why is this value still set to 1970?
Here are the corresponding host and service configurations from the objects cache:
define host {
host_name cs-9998
alias adnymics - Test-Server (Scanner)
address 9998
parents Internet
check_period 24x7
check_command check-ssh-tunnel
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image server.png
icon_image_alt CustomerServer
statusmap_image server.png
action_url /nagiosgraph/cgi-bin/showhost.cgi?host=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}
define service {
host_name cs-9998
service_description ErrorLog
check_period 24x7
check_command check_dummy!3!"Service is stale"
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 10.000000
retry_interval 2.000000
max_check_attempts 1
is_volatile 0
parallelize_check 1
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 1050
check_freshness 1
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
retain_status_information 1
retain_nonstatus_information 1
}
As I said, normally all is working fine, so I suppose this is not a configuration problem. Is there anyone who can help?
we use Nagios Core 4.0.8rc1 on Debian Linux. Everything is working fine except for one strange thing: Sometimes, we are missing notification e-mails of a service, mostly when several notifications in a short time period arrive. The error happens very sporadically, so it's hard to grasp.
Today, we had another incident:
- At 06:52, there was a new error in an error log which is supervised via Nagios. The Notification was sent as expected and at 6:57, the service returned in the OK state, also as expected and we received a notification.
- At 07:42, another error occurred and this shows up in the Nagios Alert history of the service, but not in the Notifications log.
We also have enabled the Nagios debug log:
...
[Thu Jan 22 06:57:02 2015.232885] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 06:57:02 2015.232914] [032.0] [pid=14821] Notification viability test passed.
...
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
...
There are two things which are puzzling me:
1. The log message "Not enough time has elapsed since..." points to the parameter first_notification_delay, which would be set to a high enough value so that we don't receive a notification for the second error. But this parameter is set to 0 in the Nagios configs and also shows up as 0.0000 in the objects cache. So there's no reason not to send this notification.
2. The last notification time stamp is Thu Jan 1 01:00:00 1970. Our server is running on GMT+1, so that would explain the hour 01. But there were sent out notifications since the last program start (the last one was at 06:57 with state change 2 -> 0, the one before was at 06:52 with state change 0 -> 2), so why is this value still set to 1970?
Here are the corresponding host and service configurations from the objects cache:
define host {
host_name cs-9998
alias adnymics - Test-Server (Scanner)
address 9998
parents Internet
check_period 24x7
check_command check-ssh-tunnel
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image server.png
icon_image_alt CustomerServer
statusmap_image server.png
action_url /nagiosgraph/cgi-bin/showhost.cgi?host=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}
define service {
host_name cs-9998
service_description ErrorLog
check_period 24x7
check_command check_dummy!3!"Service is stale"
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 10.000000
retry_interval 2.000000
max_check_attempts 1
is_volatile 0
parallelize_check 1
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 1050
check_freshness 1
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
retain_status_information 1
retain_nonstatus_information 1
}
As I said, normally all is working fine, so I suppose this is not a configuration problem. Is there anyone who can help?
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Notification for certain services sporadically not worki
By chance is the host down when you aren't receiving notifications for the service?
This would be expected behavior.
Or do you have any dependencies or parents setup for the host these services are on?
If your host has parents and the parent is down, you would not receive notifications.
This would be expected behavior.
Or do you have any dependencies or parents setup for the host these services are on?
If your host has parents and the parent is down, you would not receive notifications.
Re: Notification for certain services sporadically not worki
Thanks for your answer.
Host and parents are up all the time when the problem happened and there are no service dependencies.
Host and parents are up all the time when the problem happened and there are no service dependencies.
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Notification for certain services sporadically not worki
Is it that you are sporadically not seeing those service notifications what so ever? Even in the notifications list? Or in the logs? Or are you just not reviving them on your end?
Re: Notification for certain services sporadically not worki
If the notification is missing, we don't see it anywhere. We see the alert in the alert history but no notification in the notifications list and we get no e-mail. In the debug log, we see the entry I posted above:
This is strange, because we have set the first_notification_delay to 0 and this debug message only appears when this parameter is set to some value > 0. We also had a look in the source code and verified this. There's only one place where the message "Not enough time has elapsed since the service..." is logged and this is after the check for the last notification of this service and of first_notification_delay.[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Notification for certain services sporadically not worki
Can we set the first_notification_delay to 1 or at least just 0 not 0.0000, I'm not positive without digging into the code, but it is possible that this isn't a float value.MacDschie wrote:If the notification is missing, we don't see it anywhere. We see the alert in the alert history but no notification in the notifications list and we get no e-mail. In the debug log, we see the entry I posted above:
This is strange, because we have set the first_notification_delay to 0 and this debug message only appears when this parameter is set to some value > 0. We also had a look in the source code and verified this. There's only one place where the message "Not enough time has elapsed since the service..." is logged and this is after the check for the last notification of this service and of first_notification_delay.[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
Re: Notification for certain services sporadically not worki
In the config, the parameter is set to 0 (not to 0.0000):
The text I posted above is taken out of the objects cache (/usr/local/nagios/var/objects.cache). This file displays several other parameters and intervals as float, too. I have no clue if this has any impact on the functionality. If I set the parameter to 1 in the config, it is displayed as 1.0000 in the objects.cache.
Code: Select all
define service {
name generic-service
register 0
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
max_check_attempts 1
check_interval 10
retry_interval 2
--> first_notification_delay 0 <---
notification_interval 0
notification_options w,u,c,r
active_checks_enabled 1
passive_checks_enabled 1
notifications_enabled 1
check_freshness 0
check_period 24x7
contact_groups helpdesk
}
define service {
name passive-service
register 0
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 1050
use generic-service
}
define service {
service_description ErrorLog
check_command check_dummy!3!"Service is stale"
hostgroup_name customer-servers
servicegroups customer-server-services
use passive-service
contact_groups +admins
}
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Notification for certain services sporadically not worki
Got, i, I thought you were sharing your config earlier.
Can we set it to 1 for now, it will not actually delay the message as the time is calculated from the 1st non-ok state.
Honestly I've never come across another person experiencing this before
Can we set it to 1 for now, it will not actually delay the message as the time is calculated from the 1st non-ok state.
Honestly I've never come across another person experiencing this before
Re: Notification for certain services sporadically not worki
Hi scottwilkerson,
we did set first_notification_delay to 1, but this worsened the problem. We had some errors which occurred repeatedly in an interval of roughly 10 minutes and we got no mail notifications at all for days. The debug log always showed the same messages:
In every single incident, the Last Notification time was set to Thu Jan 1 01:00:00 1970. But this service has produced notifications before. So there should be some timestamp except for the epoch, shouldn't it?
we did set first_notification_delay to 1, but this worsened the problem. We had some errors which occurred repeatedly in an interval of roughly 10 minutes and we got no mail notifications at all for days. The debug log always showed the same messages:
Code: Select all
[Fri Feb 13 16:01:14 2015.723796] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'WC7855-10002', Service: 'PrinterAlerts', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Fri Feb 13 16:01:14 2015.723835] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Fri Feb 13 16:01:14 2015.723843] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Notification for certain services sporadically not worki
does your nagios.cfg have?
Code: Select all
retain_state_information=1