Wrong SOFT/HARD state logic
Posted: Tue Apr 28, 2020 3:23 pm
Hi,
First, the required information for a fast resolution
1. Linux Distribution
2. 32 or 64bit? 64 bit
3. Manual Install of XI
4. Special configurations: livestatus neb module
5. Nagios XI version: 5.6.10
6. Nagios Core version: 4.4.5
Description:
We have been observing several issues regarding state type transition of services whose host enters a SOFT DOWN state.
The following occurs:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters immediate CRITICAL HARD at first attempt (despite having max_check_attempts = 5) (BUG)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is logged as SOFT recovery (BUG if HARD was correct, which is not)
Consequences:
- availability report, which looks for HARD states, sees CRITICAL HARD and no recovery, so lots of time unavailable is shown
There are other topics on this forum displaying the same behaviour but giving focus on wrong aspects (but with good intention of consistent behaviour)
1. Unexpected HARD/SOFT state changes - no recovery alert. https://support.nagios.com/forum/viewto ... 16&t=57147
- wrong aspect: focus on recovery alert, when if it's a soft failure no notification should have been sent
Note: i don't have host_down_disable_service_checks set and I don't want to enable it.
2. Service recovery logged as soft instead of hard https://support.nagios.com/forum/viewto ... 16&t=56488
- wrong aspect: focus on not being a HARD recovery, when in the first place it is a soft failure.
- which refers to the issue https://github.com/NagiosEnterprises/na ... issues/651
In my perspective, the main issue is the service must not enter a HARD state immediately despite current attempt count when the host is SOFT DOWN. If this was a so called "feature", it will blew off the main notification mask mechanism for when a HOST is HARD DOWN, that is: avoid irrelevant service notification.
Proper behaviour would be:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters SOFT CRITICAL (attempt 1 of 5)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is SOFT recovery
For privacy I will masked host and service names.
According to the changelog https://github.com/NagiosEnterprises/na ... /Changelog this is fixed on 4.4.4 but I am seeing it on 4.4.5 
This bug is causing us a big deal because we have to correct log files in order to have proper availability figures
First, the required information for a fast resolution
1. Linux Distribution
Code: Select all
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
Code: Select all
# uname -a
Linux arba-of16l 3.10.0-693.2.2.el7.x86_64 #1 SMP Sat Sep 9 03:55:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
4. Special configurations: livestatus neb module
5. Nagios XI version: 5.6.10
Code: Select all
[root@arba-of16l ~]# cat /usr/local/nagiosxi/var/xiversion
###################################
# DO NOT DELETE THIS FILE!
# Nagios XI version information
###################################
full=5.6.10
major=5
minor=6.10
releasedate=2020-01-16
release=5610Code: Select all
[root@arba-of16l ~]# /usr/local/nagios/bin/nagios -v
Nagios Core 4.4.5
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2019-08-20
License: GPLWe have been observing several issues regarding state type transition of services whose host enters a SOFT DOWN state.
The following occurs:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters immediate CRITICAL HARD at first attempt (despite having max_check_attempts = 5) (BUG)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is logged as SOFT recovery (BUG if HARD was correct, which is not)
Consequences:
- availability report, which looks for HARD states, sees CRITICAL HARD and no recovery, so lots of time unavailable is shown
There are other topics on this forum displaying the same behaviour but giving focus on wrong aspects (but with good intention of consistent behaviour)
1. Unexpected HARD/SOFT state changes - no recovery alert. https://support.nagios.com/forum/viewto ... 16&t=57147
- wrong aspect: focus on recovery alert, when if it's a soft failure no notification should have been sent
Note: i don't have host_down_disable_service_checks set and I don't want to enable it.
2. Service recovery logged as soft instead of hard https://support.nagios.com/forum/viewto ... 16&t=56488
- wrong aspect: focus on not being a HARD recovery, when in the first place it is a soft failure.
- which refers to the issue https://github.com/NagiosEnterprises/na ... issues/651
In my perspective, the main issue is the service must not enter a HARD state immediately despite current attempt count when the host is SOFT DOWN. If this was a so called "feature", it will blew off the main notification mask mechanism for when a HOST is HARD DOWN, that is: avoid irrelevant service notification.
Proper behaviour would be:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters SOFT CRITICAL (attempt 1 of 5)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is SOFT recovery
For privacy I will masked host and service names.
Code: Select all
[root@arba-of16l archives]# grep "BPI_Link_Factory" nagios-03-06-2020-00.log
[1583377200] CURRENT HOST STATE: BPI_Link_Factory;UP;HARD;1;OK - Group health is 100% with 0 problem(s).
[1583377200] CURRENT SERVICE STATE: BPI_Link_Factory;BPI Process:Link_Factory;OK;HARD;1;OK - Group health is 100% with 0 problem(s).
[1583454663] wproc: host=BPI_Link_Factory; service=(null);
[1583454663] Warning: Check of host 'BPI_Link_Factory' timed out after 30.01 seconds
[1583454663] HOST ALERT: BPI_Link_Factory;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
[1583454693] wproc: host=BPI_Link_Factory; service=BPI Process:Link_Factory;
[1583454693] Warning: Check of service 'BPI Process:Link_Factory' on host 'BPI_Link_Factory' timed out after 60.028s!
[1583454693] SERVICE ALERT: BPI_Link_Factory;BPI Process:Link_Factory;CRITICAL;HARD;1;(Service check timed out after 60.03 seconds)
[1583454722] HOST FLAPPING ALERT: BPI_Link_Factory;STARTED; Host appears to have started flapping (22.6% change > 20.0% threshold)
[1583454722] HOST ALERT: BPI_Link_Factory;UP;SOFT;1;OK - Group health is 100% with 0 problem(s).
[1583454930] SERVICE ALERT: BPI_Link_Factory;BPI Process:Link_Factory;OK;SOFT;1;OK - Group health is 100% with 0 problem(s).
[1583457412] HOST FLAPPING ALERT: BPI_Link_Factory;STOPPED; Host appears to have stopped flapping (3.9% change < 5.0% threshold)
This bug is causing us a big deal because we have to correct log files in order to have proper availability figures