Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
define host{
use generic-host ; Name of host template to use
host_name name
alias name
address 8.4.4.4
check_command check-host-alive
max_check_attempts 10
notification_interval 60
notification_period 24x7
notification_options d,u,r
}
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
I have configured my system to send a notification and right now (if I got this correctly) is sending a notification if the service is alerting for 10 minutes, until the alert is gone. I am looking on tweaking a bit the notification time to have the first notification sent after 3 minutes and the next one after 10 minutes (and so on), until the alert is gone. Can someone advice how I can accomplish this?
Those notification intervals sound waaaaay too verbose and you are going to annoy people pretty darn quick with those metrics. Here is how you can accomplish it even though I would advise otherwise:
I believe he was suggesting that you increase the amount of time between notifications, 60 seconds is quite fast unless of course you have a team that works VERY fast.
Think about it like this, you get your first notification... you now have 10 minutes to fix the problem before another notification, are you likely to forget about the problem in the space of 10 minutes? Are you always going to be able to respond to a problem within 10 minutes, let alone solve the problem? What about complex outages where you might have 10+ services in a critical state that aren't part of a dependency structure alerting every 10 minutes?
You may also wish to consider if 2 minutes is long enough to filter out false positives, this stuff will likely be dictated by company policy but the noisier your monitoring is the more likely people will begin ignoring it (like the story of the boy who cried wolf).
We use the following metrics:
Regular check interval: 10 minutes
Retry check interval: 2 minutes
Total failed attempts before notification: 3
notification interval: 1 per hour
notify only on critical
This may or may not be suitable for you, but we know about a problem within 4 ~ 14 minutes of the problem occurring and depending on the urgency of the problem we may or may not get to it within an hour. Warnings are dealt with ad-hoc from a NOC screen.