First notification delay

Phil7269 · Post by **Phil7269** » Thu May 14, 2015 12:02 pm

Until recently I would receive my first notification from Nagios within moments of a service being monitored going down. Today I am finding that my first notification is being sent 5 minutes are the service goes down. Nothing has changed in the configuration in nagios that would account for the delay increasing to 5 mins.

What do I need to check to determine why the delay jumped from 1 min to 5 mins? I've included a copy of my define host and the define service i am using

define host{
name windows-server ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, Windows servers are monitored round the clock
check_interval 5 ; Actively check the server every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each server 10 times (max)
check_command check-host-alive ; Default command to check if servers are "alive"
notification_period 24x7 ; Send notification out at any time - day or night
notification_interval 60 ; Resend notifications every 30 minutes
notification_options d,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
hostgroups windows-servers ; Host groups that Windows servers should be a member of
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}

define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
#failure_prediction_enabled 0 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 1 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

jdalrymple · Post by **jdalrymple** » Thu May 14, 2015 12:19 pm

Phil7269 wrote:Until recently I would receive my first notification from Nagios within moments of a service being monitored going down. Today I am finding that my first notification is being sent 5 minutes are the service goes down. Nothing has changed in the configuration in nagios that would account for the delay increasing to 5 mins.

What do I need to check to determine why the delay jumped from 1 min to 5 mins? I've included a copy of my define host and the define service i am using
Code: Select all
define service{
...
        max_check_attempts              3			; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           1			; Check the service every 10 minutes under normal conditions
        retry_check_interval            2			; Re-check the service every two minutes until a hard state can be determined
...
        }

By all rights it should be 4 minutes, not 5. If you simulate failure, it may be up to 5 minutes (or thereabouts) from the time the service actually fails because of the additional (up to) 1 minute before the 1st failure and entering SOFT state.

Service enters SOFT state upon first failure, then is retried again 2 miniutes later, then 2 minutes later again. If it hasn't recovered yet it should notify.

FYI, it's atypical to have a normal_check_interval be smaller than a retry_check_interval.

Phil7269 · Post by **Phil7269** » Thu May 14, 2015 12:27 pm

Thanks for your input. Then I guess nagios is performing correctly. It just seems that in the past, being 2 days ago, the first notification would arrive a lot faster.

jdalrymple · Post by **jdalrymple** » Thu May 14, 2015 1:14 pm

There are strange circumstances that can cause the retries to come in much quicker, but you typically don't see them except in very very busy environments.

To fully understand read the host checks and service checks core documentation regarding on-demand checks. On-demand checks DO increment the check counter when in a SOFT state.

In addition the scheduler doesn't run in a totally rigid fashion. Checks can be performed a bit early or a bit late depending on how busy the system is at the time... it's usually within seconds though.

Nagios Support Forum

First notification delay

First notification delay

Re: First notification delay

Re: First notification delay

Re: First notification delay