Notification for certain services sporadically not working
Posted: Thu Jan 22, 2015 5:49 am
Hello,
we use Nagios Core 4.0.8rc1 on Debian Linux. Everything is working fine except for one strange thing: Sometimes, we are missing notification e-mails of a service, mostly when several notifications in a short time period arrive. The error happens very sporadically, so it's hard to grasp.
Today, we had another incident:
- At 06:52, there was a new error in an error log which is supervised via Nagios. The Notification was sent as expected and at 6:57, the service returned in the OK state, also as expected and we received a notification.
- At 07:42, another error occurred and this shows up in the Nagios Alert history of the service, but not in the Notifications log.
We also have enabled the Nagios debug log:
...
[Thu Jan 22 06:57:02 2015.232885] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 06:57:02 2015.232914] [032.0] [pid=14821] Notification viability test passed.
...
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
...
There are two things which are puzzling me:
1. The log message "Not enough time has elapsed since..." points to the parameter first_notification_delay, which would be set to a high enough value so that we don't receive a notification for the second error. But this parameter is set to 0 in the Nagios configs and also shows up as 0.0000 in the objects cache. So there's no reason not to send this notification.
2. The last notification time stamp is Thu Jan 1 01:00:00 1970. Our server is running on GMT+1, so that would explain the hour 01. But there were sent out notifications since the last program start (the last one was at 06:57 with state change 2 -> 0, the one before was at 06:52 with state change 0 -> 2), so why is this value still set to 1970?
Here are the corresponding host and service configurations from the objects cache:
define host {
host_name cs-9998
alias adnymics - Test-Server (Scanner)
address 9998
parents Internet
check_period 24x7
check_command check-ssh-tunnel
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image server.png
icon_image_alt CustomerServer
statusmap_image server.png
action_url /nagiosgraph/cgi-bin/showhost.cgi?host=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}
define service {
host_name cs-9998
service_description ErrorLog
check_period 24x7
check_command check_dummy!3!"Service is stale"
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 10.000000
retry_interval 2.000000
max_check_attempts 1
is_volatile 0
parallelize_check 1
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 1050
check_freshness 1
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
retain_status_information 1
retain_nonstatus_information 1
}
As I said, normally all is working fine, so I suppose this is not a configuration problem. Is there anyone who can help?
we use Nagios Core 4.0.8rc1 on Debian Linux. Everything is working fine except for one strange thing: Sometimes, we are missing notification e-mails of a service, mostly when several notifications in a short time period arrive. The error happens very sporadically, so it's hard to grasp.
Today, we had another incident:
- At 06:52, there was a new error in an error log which is supervised via Nagios. The Notification was sent as expected and at 6:57, the service returned in the OK state, also as expected and we received a notification.
- At 07:42, another error occurred and this shows up in the Nagios Alert history of the service, but not in the Notifications log.
We also have enabled the Nagios debug log:
...
[Thu Jan 22 06:57:02 2015.232885] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 0, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 06:57:02 2015.232914] [032.0] [pid=14821] Notification viability test passed.
...
[Thu Jan 22 07:42:01 2015.526826] [032.0] [pid=14821] ** Service Notification Attempt ** Host: 'cs-9998', Service: 'ErrorLog', Type: NORMAL, Options: 0, Current State: 2, Last Notification: Thu Jan 1 01:00:00 1970
[Thu Jan 22 07:42:01 2015.526849] [032.1] [pid=14821] Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet
[Thu Jan 22 07:42:01 2015.526857] [032.0] [pid=14821] Notification viability test failed. No notification will be sent out.
...
There are two things which are puzzling me:
1. The log message "Not enough time has elapsed since..." points to the parameter first_notification_delay, which would be set to a high enough value so that we don't receive a notification for the second error. But this parameter is set to 0 in the Nagios configs and also shows up as 0.0000 in the objects cache. So there's no reason not to send this notification.
2. The last notification time stamp is Thu Jan 1 01:00:00 1970. Our server is running on GMT+1, so that would explain the hour 01. But there were sent out notifications since the last program start (the last one was at 06:57 with state change 2 -> 0, the one before was at 06:52 with state change 0 -> 2), so why is this value still set to 1970?
Here are the corresponding host and service configurations from the objects cache:
define host {
host_name cs-9998
alias adnymics - Test-Server (Scanner)
address 9998
parents Internet
check_period 24x7
check_command check-ssh-tunnel
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d,u
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image server.png
icon_image_alt CustomerServer
statusmap_image server.png
action_url /nagiosgraph/cgi-bin/showhost.cgi?host=$HOSTNAME$
retain_status_information 1
retain_nonstatus_information 1
}
define service {
host_name cs-9998
service_description ErrorLog
check_period 24x7
check_command check_dummy!3!"Service is stale"
contact_groups admins,helpdesk
notification_period non-workhours
initial_state o
importance 0
check_interval 10.000000
retry_interval 2.000000
max_check_attempts 1
is_volatile 0
parallelize_check 1
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 1050
check_freshness 1
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosgraph/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
retain_status_information 1
retain_nonstatus_information 1
}
As I said, normally all is working fine, so I suppose this is not a configuration problem. Is there anyone who can help?