First let me apologize that this may have been asked several times, but I still have confusion about how to setup this monitoring in just the right way. I have looked in other forums but for me they either say "not authorized" or they are not quite the right answer I need for clarification. I too have asked this in the past, but its still confusing
Okay so question:
I need to quiet monitoring by only notifying when an alert has been in a state other than "OK" for more than x minutes. Example: there are some cleaning processes that run on a server which reset a service and some even have nightly reboots. We get alerts for this service being down almost immediately after it resets. or we get an alert that the server is rebooting immediately. I don't want any of these alerts. What I need is to ONLY be alerted if the state is OFFLINE for 10 minutes or more.
the forums I have read say you need to use a combination of "max_check_attempts", "First notification delay", "check interval", and "retry interval" but its still confusing how to achieve this. In Forums I have seen: "Setting "first_notification_delay" to 10 means that it will Wait 10 minutes before sending out a notification." but does that mean that when the alert triggered immediately, Nagios waited 10 minutes to send it? so in that time the server may have been healthy for 8 minutes, but it still needed to wait 10 min to send that initial notification? because I don't need that.
Here is the best forum I found (https://support.nagios.com/forum/viewto ... =7&t=27110), but I still need the initial question answered. Does the first notification delay just wait to send me the notification that there WAS an issue? I need it not send a notification if at the end of the first notification delay period, it is OK. (basically to check the state again to see if it still needs to send a notification after that 10 min.
I hope that makes sense.
Still totally confused: First Notification Delay
Re: Still totally confused: First Notification Delay
How it is now (if server is back online within 10 min):
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> Critical Notification sent (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
Server is back online
1 min later Fifth check (third retry attempt) HARD STATE "OK" -> Recovery Notification Sent (Elapsed time from first STATE CHANGE: 3 minutes)
__________________________________________________________________________________
How I need it (If server is back online within 10 min):
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> NO notification (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
Server is back online
1 min later Fifth check (third retry attempt) HARD STATE "OK" -> No Notification (Elapsed time from first STATE CHANGE: 3 minutes)
How I need it (If server is NOT back online within 10 min):
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> NO notification (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
1 min later Fifth check (third retry attempt) HARD STATE "Critical" -> No Notification (Elapsed time from first STATE CHANGE: 3 minutes)
5 min later Sixth check (normal check interval) HARD STATE "Critical" -> No Notification (Elapsed time from first STATE CHANGE: 8 minutes)
5 min later Sixth check (normal check interval) HARD STATE "Critical" -> Critical Notification sent (Elapsed time from first STATE CHANGE: 13 minutes)
Though looking at this now (which I am just now seeing a bit better, If I change the max check attempts from 3 to 10, that would be at the 10 min check timeframe. But how do I configure it to ONLY notify if at the final check (Changing from SOFT STATE Critical to HARD STATE Critical) and it is still offline. This is now the goal!
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> Critical Notification sent (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
Server is back online
1 min later Fifth check (third retry attempt) HARD STATE "OK" -> Recovery Notification Sent (Elapsed time from first STATE CHANGE: 3 minutes)
__________________________________________________________________________________
How I need it (If server is back online within 10 min):
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> NO notification (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
Server is back online
1 min later Fifth check (third retry attempt) HARD STATE "OK" -> No Notification (Elapsed time from first STATE CHANGE: 3 minutes)
How I need it (If server is NOT back online within 10 min):
First check the service is at a HARD STATE "OK"
Server Goes Offline
5 min later, Second check shows SOFT STATE "Critical" -> NO notification (Elapsed time from first STATE CHANGE: 0 minutes)
1 min later Third check (first retry attempt of three) SOFT STATE "Critical" Still (Elapsed time from first STATE CHANGE: 1 minutes)
1 min later Fourth check (second retry attempt of three) SOFT STATE "Critical" (Elapsed time from first STATE CHANGE: 2 minutes)
1 min later Fifth check (third retry attempt) HARD STATE "Critical" -> No Notification (Elapsed time from first STATE CHANGE: 3 minutes)
5 min later Sixth check (normal check interval) HARD STATE "Critical" -> No Notification (Elapsed time from first STATE CHANGE: 8 minutes)
5 min later Sixth check (normal check interval) HARD STATE "Critical" -> Critical Notification sent (Elapsed time from first STATE CHANGE: 13 minutes)
Though looking at this now (which I am just now seeing a bit better, If I change the max check attempts from 3 to 10, that would be at the 10 min check timeframe. But how do I configure it to ONLY notify if at the final check (Changing from SOFT STATE Critical to HARD STATE Critical) and it is still offline. This is now the goal!
Re: Still totally confused: First Notification Delay
LOL! I may have just solved for x
http://sites.box293.com/nagios/guides/c ... oft-states
I thought nagios notifies on ANY state change. According to the site above, the notification is sent out on HARD non-OK only.
http://sites.box293.com/nagios/guides/c ... oft-states
I thought nagios notifies on ANY state change. According to the site above, the notification is sent out on HARD non-OK only.
Re: Still totally confused: First Notification Delay
HA! I finally got it! Im going to try to delete my post. If I cannot, please close this as resolved.