Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
define host {
name generic-host
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_period 24x7
register 0
notification_interval 0
notification_options d,u,r
check_period 24x7
check_interval 5 ; Actively check the host every 5 minutes
retry_interval 5 ; Schedule host check retries at 5 minute intervals
max_check_attempts 12 ; Check each host 12 times before going to HARD state
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
}
And then some hosts using this template. The hosts also have a few simple services defined.
In this case, whenever a host goes down, I would expect the host down HARD state to happen after ~5*12 minutes = ~1 hour. However, since nagios 4.0.0 (tested 4.0.7 and 4.1.1) it appears that when the host is unreachable, at each service check the host check is executed again, causing the host HARD state to be reached fairly quickly.
I can confirm that this worked fine on nagios 3.5.1. Is this a regression since 4.0.0? Or can I restore the behaviour with a configuration option?
On-demand checks are performed whenever Nagios sees a need to obtain the latest status information about a particular host or service. For example, when Nagios is determining the reachability of a host, it will often perform on-demand checks of parent and child hosts to accurately determine the status of a particular network segment. On-demand checks also occur in the predictive dependency check logic in order to ensure Nagios has the most accurate status information.
However this behavior was also present in the 3.X versions, so not sure why it would have changed. Does this sound like the behavior you are seeing?
I had indeed read about the on-demand checks. The on-demand checks are indeed also happening on 3.5.1, but they don't seem to increment the counter that is counting towards the HARD state. I can confirm that for certain tomorrow.
It does sound like a bug to me. This behaviour basically means that the normal interval and retry interval timers are meaningless as soon as you assign a service to a host. Because any check of a service will force an increment in the amount of checks, regardless of the next actual scheduled check of the host.
I don't think this is the same issue. Your issue is about the host state not going into HARD immediately when max_check_attempts is set to 1. The issue I'm having is about the attempts counter increasing when the on-demand checks are executed when a service state changes.
There's not a ton I can do from a support perspective to change this behavior, it's pretty far into the dev realm to decide if this is a bug (most likely they will) and then to fix it.
Great to hear it! I didn't have time yesterday to look too deep into the code but it looks like it was a pretty minor change. It's still up to the devs to decide if this is going to make it it, but they can always make it a configurable option to compromise.
I'll be closing this thread now, but feel free to open another if you need anything in the future!