I'm having trouble figuring out why my notification delay isn't working the way I am hoping it to work...
Currently I have http content monitors as services under a host.
Under alert settings the host's have "notifications enabled" as "on" and appropriate contact groups assigned.
Also first notification delay is set to 13 minutes.
Each service has "notifications enabled" set to "skip".
The content monitors will re-check every 3 minutes on fail for 3 attempts, and upon entering a hard state fire off a service restart event handler.
However I'm still getting the alert that it's entering a hard "critical" state and it's not waiting out the 13 minutes to give the event handler time to run.
Should I have the alert setting setup differently to achieve receiving these premature alerts?
Thanks!
Alert/notification delay issues
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Alert/notification delay issues
The time for "first notification delay" is timed based on the last known OK state
not from the first failure
So if you normally check on 5 minute intervals the service would reach a HARD state at
Being you have your "first notification delay" set to 13 it would be sent immediately.
If you want the notification to go out 13 minutes after it goes into a hard state, set it to 27.
not from the first failure
So if you normally check on 5 minute intervals the service would reach a HARD state at
Code: Select all
5 + 3 + 3 + 3 = 14
If you want the notification to go out 13 minutes after it goes into a hard state, set it to 27.
Re: Alert/notification delay issues
Revisiting this, having issues again.
Currently setup like this:
Service Check Settings:
Check every 5 min
Retry interval 3 min
Retries 3 times
So that's 14 min
Upon entering a hard state it kicks off an event handler to restart the service with a 2 min delay between service stop and start.
So a total of 16 minutes should pass between last known good state and when the service comes back up.
As far as ALERT settings, I have these set on each HOST,
First Notification delay set to 20 minutes just to be sure.
And the SERVICE alert settings are just set to "skip" so that they just follow the settings of the host they reside on:
Now I'm pretty confident that this WAS working correctly for a while, however I've started to get alerts that there is a problem with services, then a few minutes later that they've recovered so it's obvious that the event handler IS working, but they are not abiding by my first notification delay setting.
Any input into why this may be?
Thanks!
Currently setup like this:
Service Check Settings:
Check every 5 min
Retry interval 3 min
Retries 3 times
So that's 14 min
Upon entering a hard state it kicks off an event handler to restart the service with a 2 min delay between service stop and start.
So a total of 16 minutes should pass between last known good state and when the service comes back up.
As far as ALERT settings, I have these set on each HOST,
First Notification delay set to 20 minutes just to be sure.
And the SERVICE alert settings are just set to "skip" so that they just follow the settings of the host they reside on:
Now I'm pretty confident that this WAS working correctly for a while, however I've started to get alerts that there is a problem with services, then a few minutes later that they've recovered so it's obvious that the event handler IS working, but they are not abiding by my first notification delay setting.
Any input into why this may be?
Thanks!
You do not have the required permissions to view the files attached to this post.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Alert/notification delay issues
Actually, it would be 21 minutes, because the check is performed every 5 minutes, so the last knows UP time is 5 minutes BEFORE the first failure.Check every 5 min
Retry interval 3 min
Retries 3 times
So that's 14 min
Upon entering a hard state it kicks off an event handler to restart the service with a 2 min delay between service stop and start.
So a total of 16 minutes should pass between last known good state and when the service comes back up.
Basically, you need to add 5 minutes to your calculation...