HARD state behavior
Posted: Thu Dec 12, 2019 10:04 am
Every now and then I see some odd state behavior. We want to be sure there is always an OK HARD state after a CRITICAL HARD state has recovered.
Below is an example of where I've seen a CRITICAL HARD state and then an OK SOFT state when it recovers. Between the two checks the remote host was reported as down. In our core config we have host_down_disable_service_checks=1 so the services do not run when a host is down which seems to be interfering with the OK recovery state. Have you seen this before and is there anything we could do to resolve this?
Date / Time Host Service State State Type Attempt
2019-12-12 07:29:42 <remote host> Service status for: <service> OK SOFT 1 of 5
2019-12-12 07:29:41 <remote host> Swap Usage OK SOFT 1 of 5
2019-12-12 07:29:31 <remote host> UP HARD 1 of 5
2019-12-12 07:24:29 <remote host> DOWN HARD 5 of 5
2019-12-12 07:23:21 <remote host> DOWN SOFT 4 of 5
2019-12-12 07:22:12 <remote host> DOWN SOFT 3 of 5
2019-12-12 07:21:04 <remote host> DOWN SOFT 2 of 5
2019-12-12 07:20:43 <remote host> Service status for: <service> CRITICAL HARD 1 of 5
2019-12-12 07:20:42 <remote host> Swap Usage CRITICAL HARD 1 of 5
2019-12-12 07:19:55 <remote host> DOWN SOFT 1 of 5
2019-12-12 06:01:04 <remote host> CPU Usage OK SOFT 2 of 5
2019-12-12 06:00:03 <remote host> CRITICAL SOFT 1 of 5
2019-12-11 16:57:13 <remote host> OK SOFT 2 of 5
2019-12-11 16:56:12 <remote host> CPU Usage CRITICAL SOFT 1 of 5
Below is an example of where I've seen a CRITICAL HARD state and then an OK SOFT state when it recovers. Between the two checks the remote host was reported as down. In our core config we have host_down_disable_service_checks=1 so the services do not run when a host is down which seems to be interfering with the OK recovery state. Have you seen this before and is there anything we could do to resolve this?
Date / Time Host Service State State Type Attempt
2019-12-12 07:29:42 <remote host> Service status for: <service> OK SOFT 1 of 5
2019-12-12 07:29:41 <remote host> Swap Usage OK SOFT 1 of 5
2019-12-12 07:29:31 <remote host> UP HARD 1 of 5
2019-12-12 07:24:29 <remote host> DOWN HARD 5 of 5
2019-12-12 07:23:21 <remote host> DOWN SOFT 4 of 5
2019-12-12 07:22:12 <remote host> DOWN SOFT 3 of 5
2019-12-12 07:21:04 <remote host> DOWN SOFT 2 of 5
2019-12-12 07:20:43 <remote host> Service status for: <service> CRITICAL HARD 1 of 5
2019-12-12 07:20:42 <remote host> Swap Usage CRITICAL HARD 1 of 5
2019-12-12 07:19:55 <remote host> DOWN SOFT 1 of 5
2019-12-12 06:01:04 <remote host> CPU Usage OK SOFT 2 of 5
2019-12-12 06:00:03 <remote host> CRITICAL SOFT 1 of 5
2019-12-11 16:57:13 <remote host> OK SOFT 2 of 5
2019-12-11 16:56:12 <remote host> CPU Usage CRITICAL SOFT 1 of 5