Ok, so from reviewing your configs, I think it might be worthwhile to spend a little bit of time reviewing the following docs, because I think there's some confusion about what some of the config directives do, and also active vs passive checks.
http://nagios.sourceforge.net/docs/3_0/ ... .html#host
http://nagios.sourceforge.net/docs/3_0/ ... ml#service
name xiwizard_ITS_Camera_host
alias
check_command check_xi_host_ping!3000.0!80%!5000.0!100%
use xiwizard_generic_host
max_check_attempts 1000
check_interval 10
retry_interval 5
active_checks_enabled 1
check_period 24x7
check_freshness 1
freshness_threshold 1800
So I'm seeing settings like this on several of the templates, and I'm honestly not quite sure what kind of effect this will have on the monitoring engine, other than to say your results will be...unpredictable.
The
only time you want to utilize freshness checking is if you're using purely passive checks. If you've got a passive check that Nagios is simply waiting for results for, the freshness check can be used to trigger an alert if the results are stale. You should never use freshness checking with active checks.
Max check attempts is how many times Nagios will
retry a check if it detects a problem.
X = max_check_attempts
Y = retry_interval
If Nagios detects a problem, it will retry the check every Y minutes up to X amount of times to determine if the problem is persisting. If the host or service is in a problem state for X number of checks, an alert will be sent. The way things are set up right now on the system creates an enormous amount of retries, and it seems like the setting that you might actually want for some of these is simply:
notifications_enabled=0
So, as for where to go from here. I would:
- stop your monitoring engine.
- Delete /usr/local/nagios/var/retention.dat
- Remove ALL freshness checking from all templates and objects
- Revise your max check attempts on templates and objects, I would recommend against having your max_check_attempts higher than 10 if you can help it, otherwise you're just wasting resources on the monitoring engine.
- Although there are exceptions, most of the time retries should be happening inside of the regular check interval. If you're checking the host/service every 10 minutes, it doesn't make a lot of sense to have 60 minutes worth of retries every 5 minutes. If you want a 60 minute buffer before notifications go out, use the first_notification_delay instead.