HARD;OK plus notification without a HARD;CRITICAL

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

It's been a while, but we finally have a customer worth of me asking a question here. :-)

Take a close look at this nagios.log file snippet (UNIX datestamp converted to human readable form). I've changed service name to ServiceName and host name to HostName to preserve some secret details. The important stuff is from 05:26:47 through 05:27:54:

Code: Select all

Wed Mar 13 00:11:27 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 00:12:29 2019 - SERVICE ALERT: HostName;ServiceName;OK;SOFT;2;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:25:36 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 05:26:47 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;2;CRITICAL - Plugin timed out
Wed Mar 13 05:27:54 2019 - SERVICE NOTIFICATION: HostName;ServiceName;OK;notify-service-by-email;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:27:54 2019 - SERVICE ALERT: HostName;ServiceName;OK;HARD;3;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
Wed Mar 13 05:53:12 2019 - SERVICE ALERT: HostName;ServiceName;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
Wed Mar 13 05:54:14 2019 - SERVICE ALERT: HostName;ServiceName;OK;SOFT;2;UK-BACKUP:true UK-LTS:true UK-VMtemplates:true
The recovery at 05:27:54 sent an OK;HARD notification. Yet, there was never a CRITICAL;HARD state and there was never a critical notification sent in the first place. This sounds suspiciously like a bug to me. Nagios XI 5.5.11 freshly rebuilt just last week.

This customer has LOTS of examples of improper notifications but most of them stopped after the 5.5.11 upgrade (but, for reasons I can't go into here, they don't upgrade, they always freshly reinstall). This is the first one since then that has sent an OK;HARD notification without having first sent a CRITICAL;HARD notification. Also, shouldn't the HARD;OK have been a SOFT;OK in the first place? After all, there was no HARD;CRITICAL first.

Thanks.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

More. Same problem. This time, it's a HOST check. Notification occurs at 02:44:07 when the service should be in SOFT;OK, not HARD;OK State:

Code: Select all

Wed Mar 13 00:00:00 2019 - CURRENT HOST STATE: HostName;UP;HARD;1;OK - 172.17.24.26: rta 118.413ms, lost 0%
Wed Mar 13 00:00:00 2019 - CURRENT SERVICE STATE: HostName;PING;OK;HARD;1;OK - 172.17.24.26: rta 121.112ms, lost 0%
Wed Mar 13 01:03:33 2019 - HOST ALERT: HostName;DOWN;SOFT;1;(Host check timed out after 30.01 seconds)
Wed Mar 13 01:05:05 2019 - HOST ALERT: HostName;UNREACHABLE;SOFT;2;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 01:05:11 2019 - SERVICE ALERT: HostName;PING;CRITICAL;HARD;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 01:05:39 2019 - HOST ALERT: HostName;UP;SOFT;1;OK - 172.17.24.26: rta 118.773ms, lost 0%
Wed Mar 13 01:09:44 2019 - SERVICE ALERT: HostName;PING;OK;SOFT;1;OK - 172.17.24.26: rta 132.757ms, lost 0%
Wed Mar 13 02:41:33 2019 - SERVICE ALERT: HostName;PING;CRITICAL;SOFT;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 02:43:04 2019 - SERVICE ALERT: HostName;PING;CRITICAL;SOFT;2;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 02:44:07 2019 - SERVICE NOTIFICATION: HostName;PING;OK;notify-service-by-email;OK - 172.17.24.26: rta 119.729ms, lost 0%
Wed Mar 13 02:44:07 2019 - SERVICE ALERT: HostName;PING;OK;HARD;3;OK - 172.17.24.26: rta 119.729ms, lost 0%
Wed Mar 13 04:05:41 2019 - SERVICE ALERT: HostName;PING;WARNING;SOFT;1;WARNING - 172.17.24.26: rta 118.980ms, lost 80%
Wed Mar 13 04:07:11 2019 - SERVICE ALERT: HostName;PING;OK;SOFT;2;OK - 172.17.24.26: rta 118.870ms, lost 20%
Wed Mar 13 04:16:19 2019 - HOST ALERT: HostName;UNREACHABLE;SOFT;1;CRITICAL - 172.17.24.26: rta nan, lost 100%
Wed Mar 13 04:16:21 2019 - HOST ALERT: HostName;UP;SOFT;1;OK - 172.17.24.26: rta 441.533ms, lost 0%
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
jomann
Development Lead
Posts: 611
Joined: Mon Apr 22, 2013 10:06 am
Location: Nagios Enterprises

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by jomann »

This does look a little suspicious. I have a couple questions that might narrow it down.

So the services that are sending these are sending them when it goes from SOFT -> HARD on the 3rd check attempt, and the max check attempts is set to 3 right? If that is the case, then the second question is do you know what kind of email it is sending? Is it sending an actual recovery message or is it sending a state change message?

I can do a little testing myself too to see if the last check causes this to happen, and if it does it sounds like it could be a bug.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

Waiting to get actual copies of the notifications, but they were definitely recovery messages.

And yes, retry_limit is 3.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

Another example that happened this morning. You can see the forced status after the log rotate at midnight (GMT) and then, well, you know how to read logs. :-)

Code: Select all

Thu Mar 14 00:00:00 2019 - CURRENT SERVICE STATE: HostName;DisksUsage;OK;HARD;1;OK : (80%) D: 13% G: 49% C: 38% F: 6%
Thu Mar 14 00:10:51 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;1;ERROR: General time-out (Alarm signal)
Thu Mar 14 00:11:52 2019 - SERVICE ALERT: HostName;DisksUsage;OK;SOFT;2;OK : (80%) D: 13% G: 49% C: 44% F: 6%
Thu Mar 14 08:03:28 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;1;ERROR: General time-out (Alarm signal)
Thu Mar 14 08:04:42 2019 - SERVICE ALERT: HostName;DisksUsage;UNKNOWN;SOFT;2;ERROR: General time-out (Alarm signal)
Thu Mar 14 08:05:43 2019 - SERVICE NOTIFICATION: Worldwide_Inside;HostName;DisksUsage;OK;notify-service-by-email;OK : (80%) D: 13% G: 49% C: 48% F: 7%
Thu Mar 14 08:05:43 2019 - SERVICE ALERT: HostName;DisksUsage;OK;HARD;3;OK : (80%) D: 13% G: 49% C: 48% F: 7%
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

Here's the original (sanitized) alert email:

Code: Select all

From: [email protected] <[email protected]>
Sent: Wednesday, March 13, 2019 01:28
To: [email protected]
Subject: ** RECOVERY Service Alert: HostName/ServiceName is OK **

***** Nagios Monitor XI Alert *****

Notification Type: RECOVERY

Service: ServiceName
Host: HostName
Address: 172.16.26.35
State: OK

Date/Time: Wed Mar 13 05:27:54 UTC 2019

Additional Info:

UK-BACKUP:true UK-LTS:true UK-VMtemplates:true

ACK this problem: https://nagios-box/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=HostName&service=ServiceName
SMS slug: Ack RECOVERY HostName,ServiceName>

Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
mbeebe
Posts: 144
Joined: Thu Dec 20, 2018 5:12 pm

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by mbeebe »

Hello,

We are also seeing this issue. I thought this was resolved in Core 4.4.2?

Our stats:

Nagios Core 4.4.3

Nagios XI version: 5.5.9
XI installed from: source

Red Hat Enterprise Linux Server release 7.6 (Maipo)

-- Mike Beebe
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

For what it's worth, we cannot duplicate (or find records of) this on Nagios Core 4.4.1.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
jomann
Development Lead
Posts: 611
Joined: Mon Apr 22, 2013 10:06 am
Location: Nagios Enterprises

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by jomann »

This sounds like a bug in the logic when the check returns OK on the last retry. We should move this over to Github so that I can take a closer look and get a fix in place. I believe the logic, due to the many changes that had to be made to fix everything for 4.4.3, likely caused this to happen in this specific case where the OK is received as the very last retry.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: HARD;OK plus notification without a HARD;CRITICAL

Post by eloyd »

Move away. I'll continue 'round the horn' when it's there.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
Locked