rajasegar wrote:Thanks. What is the logic behind this?
Nagios checks host status before every check or only once before firing all services check?
It gets a little complicated but here's the basics in a standard nagios server:
Every time a service check is to be executed, it looks at the host object and determines if it is in a host down HARD state.
If the host is HARD down, it is executed and re-scheduled at the next check_interval or retry_interval HOWEVER no service notifications are sent.
When the service is progressing through max_check_attempts, the retry_interval is used.
Once max_check_attempts has been reached, the service is in a DOWN hard state and the check_interval is used.
When
host_down_disable_service_checks=1 is implemented, some of this is a little different.
Assume that the host object has gone through the max_check_attempts in it's definition, it is now in a HARD down state.
Assume that while this has been happening, the service has a larger check_interval, so it has not had a check since the host was last UP.
So when the service check determines the host is down HARD, it is re-scheduled at the next check_interval of the service object, because the last state of the service object was OK.
The service check is NOT executed
(the purpose of host_down_disable_service_checks=1) and hence the service stays in an OK state.
Realistically, while the host object has been going through the max_check_attempts in it's definition to determine if it's down hard, if the service check has small check_intervals, before the host object reaches a down hard state, the service object detects an issue and starts re-scheduling it's next check using the retry_interval directive.
So the service check starts being checked more often until the service check max_check_attempts is reached, while it is doing this, the service object will be in a SOFT state.
Once the host object goes into a HARD down state, the service check is NOT executed
(the purpose of host_down_disable_service_checks=1) and will continue to be re-scheduled at the retry_interval as the service is currently in a SOFT state.
However because it is in a SOFT state, it will remain in a Critical/Warning/Unknown until the next successful execution of the service check when the host returns to a HARD UP state.
Because it can sometimes take a while for host objects to go down hard, you will get some services that appear in a SOFT state which will appear in the Tactical Overview. It is very hard to avoid this.
Does this make sense?
Here's a detailed example that explains host and service check intervals:
http://sites.box293.com/nagios/guides/c ... -intervals
Here's a detailed example that explains hard and soft states:
http://sites.box293.com/nagios/guides/c ... oft-states
One key topic in all of this is to make sure your host objects go HARD down BEFORE the services get a chance to. Using the same intervals on your host and service objects will cause unnecessary notifications. I covered some of this in my talk at the Nagios World Conference on Nagios XI Best Practices:
https://www.youtube.com/watch?v=6WlZrG-_sAI