Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
We've found that when a host goes into a hard state that the current_attempt will change to 1 just after going into a hard state. Can you confirm that this is a bug? Is there a setting that triggers this behavior? It doesn't happen for services. This happens in nagios3 and nagios4.
The example below shows the host's attempts increasing. Once the it hits the hard state, the current_attempts goes to 1 after the next check.
Is the above behavior happening when the service's dependent host is in a down state? If so, this may offer insight:
As always, there are exceptions to the rules. When a service check results in a non-OK state, Nagios will check the host that the service is associated with to determine whether or not is up (see the note below for info on how this is done). If the host is not up (i.e. it is either down or unreachable), Nagios will immediately put the service into a hard non-OK state and it will reset the current attempt number to 1. Since the service is in a hard non-OK state, the service check will be rescheduled at the normal frequency specified by the check_interval option instead of the retry_interval option.
Is this happening across the board? What happens if you submit a passive up state to one of the hosts showing this behavior on it's service's and then disable active checking on that host to keep it locked in that state?
I turned off active checks on the host and submitted a passive UP to the host and I get the same behavior on the services.
Do you see this on your side? I see this on multiple instances of nagios -- there isn't a system where I haven't seen this behavior. I prefer the behavior of the services where the current attempt stays at the max attempts when it goes into a hard state. I want to know the reasoning for the the hosts current attempt going to 1. It's not consistent with services which is why we and our customers have noticed it.
Do you know where about in the code this is happening -- looking for a starting point/hint? I can debug it and try to get some more information.
Cool. I originally saw it in the UI/XI, but was using livestatus for the report.
We discovered this by looking at the Hosts Details pages and wondering why the hosts were critical with an attempt of 1/5. We thought something was up until we saw that the state history was correct. I know that that current attempt doesn't always correlate to a hard state (ex. a service's host is down or dependencies) but this one seemed off.
Thanks for submitting the bug and looking into this.