HARD;OK plus notification without a HARD;CRITICAL

Post by **eloyd** » Wed Mar 13, 2019 8:19 am

It's been a while, but we finally have a customer worth of me asking a question here.

Take a close look at this nagios.log file snippet (UNIX datestamp converted to human readable form). I've changed service name to ServiceName and host name to HostName to preserve some secret details. The important stuff is from 05:26:47 through 05:27:54:

Code: Select all

Wed Mar 13 00:11:27 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 00:12:29 2019 - SERVICE ALERT: HostName;ServiceName;OK;SOFT;2;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:25:36 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 05:26:47 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;2;CRITICAL - Plugin timed out
Wed Mar 13 05:27:54 2019 - SERVICE NOTIFICATION: HostName;ServiceName;OK;notify-service-by-email;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:27:54 2019 - SERVICE ALERT: HostName;ServiceName;OK;HARD;3;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:53:12 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 05:54:14 2019 - SERVICE ALERT: HostName;ServiceName;OK;SOFT;2;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true

The recovery at 05:27:54 sent an OK;HARD notification. Yet, there was never a CRITICAL;HARD state and there was never a critical notification sent in the first place. This sounds suspiciously like a bug to me. Nagios XI 5.5.11 freshly rebuilt just last week.

This customer has LOTS of examples of improper notifications but most of them stopped after the 5.5.11 upgrade (but, for reasons I can't go into here, they don't upgrade, they always freshly reinstall). This is the first one since then that has sent an OK;HARD notification without having first sent a CRITICAL;HARD notification. Also, shouldn't the HARD;OK have been a SOFT;OK in the first place? After all, there was no HARD;CRITICAL first.

Thanks.

Post by **eloyd** » Wed Mar 13, 2019 9:09 am

More. Same problem. This time, it's a HOST check. Notification occurs at 02:44:07 when the service should be in SOFT;OK, not HARD;OK State:

Code: Select all

Wed Mar 13 00:00:00 2019 - CURRENT HOST STATE: HostName;UP;HARD;1;OK - 172.17.24.26: rta 118.413ms, lost 0%
Wed Mar 13 00:00:00 2019 - CURRENT SERVICE STATE: HostName;PING;OK;HARD;1;OK - 172.17.24.26: rta 121.112ms, lost 0%
Wed Mar 13 01:03:33 2019 - HOST ALERT: HostName;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
Wed Mar 13 01:05:05 2019 - HOST ALERT: HostName;UNREACHABLE;SOFT;2;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 01:05:11 2019 - SERVICE ALERT: HostName;PING;CRITICAL;HARD;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 01:05:39 2019 - HOST ALERT: HostName;UP;SOFT;1;OK - 172.17.24.26: rta 118.773ms, lost 0%
Wed Mar 13 01:09:44 2019 - SERVICE ALERT: HostName;PING;OK;SOFT;1;OK - 172.17.24.26: rta 132.757ms, lost 0%
Wed Mar 13 02:41:33 2019 - SERVICE ALERT: HostName;PING;CRITICAL;SOFT;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 02:43:04 2019 - SERVICE ALERT: HostName;PING;CRITICAL;SOFT;2;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 02:44:07 2019 - SERVICE NOTIFICATION: HostName;PING;OK;notify-service-by-email;OK - 172.17.24.26: rta 119.729ms, lost 0%
Wed Mar 13 02:44:07 2019 - SERVICE ALERT: HostName;PING;OK;HARD;3;OK - 172.17.24.26: rta 119.729ms, lost 0%
Wed Mar 13 04:05:41 2019 - SERVICE ALERT: HostName;PING;WARNING;SOFT;1;WARNING - 172.17.24.26: rta 118.980ms, lost 80%
Wed Mar 13 04:07:11 2019 - SERVICE ALERT: HostName;PING;OK;SOFT;2;OK - 172.17.24.26: rta 118.870ms, lost 20%
Wed Mar 13 04:16:19 2019 - HOST ALERT: HostName;UNREACHABLE;SOFT;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 04:16:21 2019 - HOST ALERT: HostName;UP;SOFT;1;OK - 172.17.24.26: rta 441.533ms, lost 0%

jomann · Post by **jomann** » Wed Mar 13, 2019 4:02 pm

This does look a little suspicious. I have a couple questions that might narrow it down.

So the services that are sending these are sending them when it goes from SOFT -> HARD on the 3rd check attempt, and the max check attempts is set to 3 right? If that is the case, then the second question is do you know what kind of email it is sending? Is it sending an actual recovery message or is it sending a state change message?

I can do a little testing myself too to see if the last check causes this to happen, and if it does it sounds like it could be a bug.

Post by **eloyd** » Wed Mar 13, 2019 6:17 pm

Waiting to get actual copies of the notifications, but they were definitely recovery messages.

And yes, retry_limit is 3.

Post by **eloyd** » Thu Mar 14, 2019 7:38 am

Another example that happened this morning. You can see the forced status after the log rotate at midnight (GMT) and then, well, you know how to read logs.

Code: Select all

Thu Mar 14 00:00:00 2019 - CURRENT SERVICE STATE: HostName;DisksUsage;OK;HARD;1;OK : (80%) D: 13% G: 49% C: 38% F: 6%
Thu Mar 14 00:10:51 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;1;ERROR: General time-out (Alarm signal)
Thu Mar 14 00:11:52 2019 - SERVICE ALERT: HostName;DisksUsage;OK;SOFT;2;OK : (80%) D: 13% G: 49% C: 44% F: 6%
Thu Mar 14 08:03:28 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;1;ERROR: General time-out (Alarm signal)
Thu Mar 14 08:04:42 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;2;ERROR: General time-out (Alarm signal)
Thu Mar 14 08:05:43 2019 - SERVICE NOTIFICATION: Worldwide_Inside;HostName;DisksUsage;OK;notify-service-by-email;OK : (80%) D: 13% G: 49% C: 48% F: 7%
Thu Mar 14 08:05:43 2019 - SERVICE ALERT: HostName;DisksUsage;OK;HARD;3;OK : (80%) D: 13% G: 49% C: 48% F: 7%

Post by **eloyd** » Thu Mar 14, 2019 10:13 am

Here's the original (sanitized) alert email:

Code: Select all

From: [email protected] <[email protected]>
Sent: Wednesday, March 13, 2019 01:28
To: [email protected]
Subject: ** RECOVERY Service Alert: HostName/ServiceName is OK **

***** Nagios Monitor XI Alert *****

Notification Type: RECOVERY

Service: ServiceName
Host: HostName
Address: 172.16.26.35
State: OK

Date/Time: Wed Mar 13 05:27:54 UTC 2019

Additional Info:

UK-BACKUP:true UK-LTS:true UK-VMtemplates:true

ACK this problem: https://nagios-box/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=HostName&service=ServiceName
SMS slug: Ack RECOVERY HostName,ServiceName>

mbeebe · Post by **mbeebe** » Thu Mar 14, 2019 10:20 am

Hello,

We are also seeing this issue. I thought this was resolved in Core 4.4.2?

Our stats:

Nagios Core 4.4.3

Nagios XI version: 5.5.9
XI installed from: source

Red Hat Enterprise Linux Server release 7.6 (Maipo)

-- Mike Beebe

Post by **eloyd** » Thu Mar 14, 2019 10:39 am

For what it's worth, we cannot duplicate (or find records of) this on Nagios Core 4.4.1.

jomann · Post by **jomann** » Thu Mar 14, 2019 11:46 am

This sounds like a bug in the logic when the check returns OK on the last retry. We should move this over to Github so that I can take a closer look and get a fix in place. I believe the logic, due to the many changes that had to be made to fix everything for 4.4.3, likely caused this to happen in this specific case where the OK is received as the very last retry.

Post by **eloyd** » Thu Mar 14, 2019 11:54 am

Move away. I'll continue 'round the horn' when it's there.

Nagios Support Forum

HARD;OK plus notification without a HARD;CRITICAL

HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL

Re: HARD;OK plus notification without a HARD;CRITICAL