Page 1 of 1

Wrong SOFT/HARD state logic

Posted: Tue Apr 28, 2020 3:23 pm
by nmatsunaga
Hi,

First, the required information for a fast resolution
1. Linux Distribution

Code: Select all

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
2. 32 or 64bit? 64 bit

Code: Select all

# uname -a
Linux arba-of16l 3.10.0-693.2.2.el7.x86_64 #1 SMP Sat Sep 9 03:55:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
3. Manual Install of XI
4. Special configurations: livestatus neb module
5. Nagios XI version: 5.6.10

Code: Select all

[root@arba-of16l ~]# cat /usr/local/nagiosxi/var/xiversion
###################################
# DO NOT DELETE THIS FILE!
# Nagios XI version information
###################################
full=5.6.10
major=5
minor=6.10
releasedate=2020-01-16
release=5610
6. Nagios Core version: 4.4.5

Code: Select all

[root@arba-of16l ~]# /usr/local/nagios/bin/nagios -v
 
Nagios Core 4.4.5
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2019-08-20
License: GPL
Description:
We have been observing several issues regarding state type transition of services whose host enters a SOFT DOWN state.
The following occurs:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters immediate CRITICAL HARD at first attempt (despite having max_check_attempts = 5) (BUG)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is logged as SOFT recovery (BUG if HARD was correct, which is not)

Consequences:
- availability report, which looks for HARD states, sees CRITICAL HARD and no recovery, so lots of time unavailable is shown

There are other topics on this forum displaying the same behaviour but giving focus on wrong aspects (but with good intention of consistent behaviour)
1. Unexpected HARD/SOFT state changes - no recovery alert. https://support.nagios.com/forum/viewto ... 16&t=57147
- wrong aspect: focus on recovery alert, when if it's a soft failure no notification should have been sent
Note: i don't have host_down_disable_service_checks set and I don't want to enable it.
2. Service recovery logged as soft instead of hard https://support.nagios.com/forum/viewto ... 16&t=56488
- wrong aspect: focus on not being a HARD recovery, when in the first place it is a soft failure.
- which refers to the issue https://github.com/NagiosEnterprises/na ... issues/651

In my perspective, the main issue is the service must not enter a HARD state immediately despite current attempt count when the host is SOFT DOWN. If this was a so called "feature", it will blew off the main notification mask mechanism for when a HOST is HARD DOWN, that is: avoid irrelevant service notification.

Proper behaviour would be:
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters SOFT CRITICAL (attempt 1 of 5)
- HOST check returns UP, it is a SOFT recovery
- SERVICE check returns OK, it is SOFT recovery

For privacy I will masked host and service names.

Code: Select all

[root@arba-of16l archives]# grep "BPI_Link_Factory" nagios-03-06-2020-00.log
[1583377200] CURRENT HOST STATE: BPI_Link_Factory;UP;HARD;1;OK - Group health is 100% with 0 problem(s).
[1583377200] CURRENT SERVICE STATE: BPI_Link_Factory;BPI Process:Link_Factory;OK;HARD;1;OK - Group health is 100% with 0 problem(s).
[1583454663] wproc:   host=BPI_Link_Factory; service=(null);
[1583454663] Warning: Check of host 'BPI_Link_Factory' timed out after 30.01 seconds
[1583454663] HOST ALERT: BPI_Link_Factory;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
[1583454693] wproc:   host=BPI_Link_Factory; service=BPI Process:Link_Factory;
[1583454693] Warning: Check of service 'BPI Process:Link_Factory' on host 'BPI_Link_Factory' timed out after 60.028s!
[1583454693] SERVICE ALERT: BPI_Link_Factory;BPI Process:Link_Factory;CRITICAL;HARD;1;(Service check timed out after 60.03 seconds)
[1583454722] HOST FLAPPING ALERT: BPI_Link_Factory;STARTED; Host appears to have started flapping (22.6% change > 20.0% threshold)
[1583454722] HOST ALERT: BPI_Link_Factory;UP;SOFT;1;OK - Group health is 100% with 0 problem(s).
[1583454930] SERVICE ALERT: BPI_Link_Factory;BPI Process:Link_Factory;OK;SOFT;1;OK - Group health is 100% with 0 problem(s).
[1583457412] HOST FLAPPING ALERT: BPI_Link_Factory;STOPPED; Host appears to have stopped flapping (3.9% change < 5.0% threshold)
According to the changelog https://github.com/NagiosEnterprises/na ... /Changelog this is fixed on 4.4.4 but I am seeing it on 4.4.5 :shock:

This bug is causing us a big deal because we have to correct log files in order to have proper availability figures

Re: Wrong SOFT/HARD state logic

Posted: Tue Apr 28, 2020 4:25 pm
by cdienger
Thanks for all the information, but I think the underlying idea that a service should go into a SOFT NON-OK state if the host is in a SOFT DOWN or UNREACHABLE state is incorrect. As mentioned in one of the posts you provided, this is as designed and is documented at https://assets.nagios.com/downloads/nag ... types.html:
Hard States
Hard states occur for hosts and services in the following situations:

When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This is a hard error state.
When a host or service transitions from one hard error state to another error state (e.g. WARNING to CRITICAL).
When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.
When a host or service recovers from a hard error state. This is considered to be a hard recovery.
When a passive host check is received. Passive host checks are treated as HARD unless the passive_host_checks_are_soft option is enabled.

Re: Wrong SOFT/HARD state logic

Posted: Tue Apr 28, 2020 4:33 pm
by ssax
This is actually intended functionality per development. We spent a long time discussing this in great length and the functionality works as they are intending it to. The workaround is to set host_down_disable_service_checks=1 in your /usr/local/nagios/etc/nagios.cfg (restart the nagios service) so the services won't even run if the host is in a problem state (hard or soft).

The way that it's supposed to work is that when the service checks and detects a problem it then checks the host and if the host is in a down state (hard or soft), the service will go into a hard problem state, it won't go through the soft states if the host is down. I'm referring to this specifically:
When a service check results in a non-OK state, Nagios will check the host that the service is associated with to determine whether or not is UP. If the host is not UP (i.e. it is either down or unreachable), Nagios will immediately put the service into a hard non-OK state and it will reset the current attempt number to 1. Since the service is in a hard non-OK state, the service check will be rescheduled at the normal frequency specified by the check_interval option instead of the retry_interval option.
Taken from here:

https://assets.nagios.com/downloads/nag ... uling.html

Re: Wrong SOFT/HARD state logic

Posted: Tue Apr 28, 2020 5:09 pm
by ssax
After talking with the devs a little more on this there looks like there's a bug with the service recovery showing only SOFT like you're saying.

https://github.com/NagiosEnterprises/na ... issues/759

Re: Wrong SOFT/HARD state logic

Posted: Mon May 04, 2020 6:31 pm
by nmatsunaga
ssax wrote:After talking with the devs a little more on this there looks like there's a bug with the service recovery showing only SOFT like you're saying.

https://github.com/NagiosEnterprises/na ... issues/759
I think there is something wrong on thinking it should be a service immediate HARD state while host state type is still SOFT
- @cdienger arguments that service state type is correct and should be HARD based on:
https://assets.nagios.com/downloads/nag ... types.html
When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE.

In my opinion there is a missing HARD type in that documentation and it should be
When a service check results in a non-OK state and its corresponding host is either HARD DOWN or HARD UNREACHABLE.

Why do i think that? based on:
https://assets.nagios.com/downloads/nag ... tions.html
If the host is in hard non-OK state, notifications for services on this host won't be sent out.
If host is SOFT DOWN and any non-OK service checks results in non-OK HARD, there will be no point in this notification masking logic.
Suppose I did my homework and calculated the worst case scenario time of host non-OK HARD versus best case scenario time of service non-OK HARD in order to avoid service notifications when there is really a host problem. In the scenario you are backing, I will always get the service notifications even when it is a host problem, so we are allowing irrelevant service notification which were previously filtered. The only case where this filter will avoid service notifications is if max_check_attemps = 1 for the host check (which is not always desired)

Worst case scenario HARD time: check_interval + (max_check_attempts - 1)*retry_interval
Best case scenario HARD time: (max_check_attempts - 1)*retry_interval

Moreover, there in https://github.com/NagiosEnterprises/na ... e/checks.c on line 1345 the comment support your side:

Code: Select all

/* service hard state change, because if host is down/unreachable
  the docs say we have a hard state change (but no notification) */
But let me introduce a case where this breaks a needed service notification
- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters HARD CRITICAL (attempt 1 of 5) with NO NOTIFICATION
- HOST check returns UP, it is a SOFT recovery
- SERVICE check continues to give CRITICAL because it was a real failure and no notfication will be sent

Furthermore, I have been doing some testing on a lab with Nagios Core 4.4.5 and 4.3.4 (the newest version which I recall behaves properly)

Nagios Core 4.4.5 : we can see exactly where the bug appears (if host check fails before service check)

Test case #1: service check fails before host check
notes: service goes to an artificial 6/6 HARD state and then a HARD recovery occurs

Code: Select all

[1588630964] SERVICE ALERT: test;Manual test service;CRITICAL;SOFT;1;CRITICAL test
[1588630964] HOST ALERT: test;DOWN;SOFT;1;CRITICAL test
[1588631024] SERVICE ALERT: test;Manual test service;CRITICAL;HARD;6;CRITICAL test
[1588631024] HOST ALERT: test;DOWN;SOFT;2;CRITICAL test
(checks back to normal)
[1588631084] SERVICE ALERT: test;Manual test service;OK;HARD;6;OK test
[1588631084] HOST ALERT: test;UP;SOFT;1;OK test
Test case #2: host check fails before service check
note: service goes to an artificial 1/6 HARD state and then does a SOFT recovery (BUG!)

Code: Select all

[1588631324] HOST ALERT: test;DOWN;SOFT;1;CRITICAL test
[1588631358] SERVICE ALERT: test;Manual test service;CRITICAL;HARD;1;CRITICAL test
(checks back to normal)
[1588631384] HOST ALERT: test;UP;SOFT;1;OK test
[1588631418] SERVICE ALERT: test;Manual test service;OK;SOFT;1;OK test
Nagios Core 4.3.4

Test case #1: service check fails before host check

Code: Select all

[1588632579] SERVICE ALERT: test;Manual test service;CRITICAL;SOFT;1;CRITICAL test
[1588632579] HOST ALERT: test;DOWN;SOFT;1;CRITICAL test
(MISSING SERVICE ALERT line even though I have log_service_retries)
[1588632639] HOST ALERT: test;DOWN;SOFT;2;CRITICAL test
(checks back to normal)
[1588632699] SERVICE ALERT: test;Manual test service;OK;SOFT;3;OK test
[1588632699] HOST ALERT: test;UP;SOFT;3;OK test
Test case #2: host check fails before service check

Code: Select all

[1588632902] HOST ALERT: test;DOWN;SOFT;1;CRITICAL test
[1588632929] SERVICE ALERT: test;Manual test service;CRITICAL;SOFT;1;CRITICAL test
[1588632929] HOST ALERT: test;DOWN;SOFT;2;CRITICAL test
(checks back to normal)
[1588632989] SERVICE ALERT: test;Manual test service;OK;SOFT;2;OK test
[1588632989] HOST ALERT: test;UP;SOFT;3;OK test
Test case #3
notes: here we can see that the service goes to a 3/6 HARD state (it goes immediately to HARD after host is HARD)

Code: Select all

[1588633155] HOST ALERT: test;DOWN;SOFT;1;CRITICAL test
[1588633171] SERVICE ALERT: test;Manual test service;CRITICAL;SOFT;1;CRITICAL test
[1588633171] HOST ALERT: test;DOWN;SOFT;2;CRITICAL test
[1588633231] HOST ALERT: test;DOWN;HARD;3;CRITICAL test
[1588633231] HOST NOTIFICATION: nagiosadmin;test;DOWN;notify-host-by-email;CRITICAL test
[1588633291] SERVICE ALERT: test;Manual test service;CRITICAL;HARD;3;CRITICAL test
In summary, please roll back to 4.3.4 behaviour. Immediate service HARD state should occur only if host is non-OK HARD.

I’m eager to receive your feedback

Re: Wrong SOFT/HARD state logic

Posted: Tue May 05, 2020 11:22 am
by ssax
We (support team) had this discussion with development already and they said the current functionality is how they want it (minus the soft recovery bug). While I agree with your opinion on it development was clear on how they wanted the functionality to work going forward.

Please submit a request here for them to change it and state your case, they will see it here:

https://github.com/NagiosEnterprises/nagioscore/issues

Please keep in mind that the decision to implement the enhancement is at the discretion of our development team.

Re: Wrong SOFT/HARD state logic

Posted: Wed May 13, 2020 9:19 am
by nmatsunaga
But your new logic that the dev team have discussed and wanted poses a big risk of no notification in the current example:

- HOST enters SOFT DOWN (attempt 1 of 5)
- SERVICE enters HARD CRITICAL (attempt 1 of 5) with NO NOTIFICATION because host is down
- HOST check returns UP, it is a SOFT recovery
- SERVICE check continues to give CRITICAL because it was a real failure and no notfication will be sent

The underlying bug is that the service notification was suppressed by a host non-UP SOFT state

It is not an enhancement/feature request, it is a bug fix request,

I will open an issue on https://github.com/NagiosEnterprises/nagioscore/issues

Re: Wrong SOFT/HARD state logic

Posted: Thu May 14, 2020 7:25 am
by scottwilkerson
nmatsunaga wrote:I will open an issue on https://github.com/NagiosEnterprises/nagioscore/issues
Ok, sounds good.

Locking forum thread