SOFT states look like they're resetting without an OK

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

SOFT states look like they're resetting without an OK

Post by eloyd »

We have a customer that says that they didn't get notifications for a certain host being down (the host check is a ping, and the only service on the host is a ping as well). Below is the event log, which really confuses me, because I see the SOFT errors (max check attempts is 10 for the host; we're changing that) but it seems to be generating random numbers for the sequence. I expect to see "SOFT;1" followed by "SOFT;2" and "SOFT;3," etc. Instead, I see SOFT 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 2, 2, 3, 4, 5, 6, 7, 8, 2, 3, 4, 5, 6, 7, 8 and then the host finally gets OK and never gives a HARD CRITICAL on the host check (or the service check).

Can someone 'splain the non-sequential SOFT checks please? Why do they seem to be resetting without any OKs in the middle? This is on Nagios XI 5.2.9 and I can't upgrade to 5.3.3 just yet (though I hope to be able to by end of year).

Code: Select all

	2016-12-08 10:39:46	SERVICE ALERT: hostname;Ping;OK;HARD;5;OK - 10.1.1.190: rta 2.613ms, lost 0%
Host Recovery	2016-12-08 10:36:26	HOST ALERT: hostname;UP;SOFT;9;OK - 10.1.1.190: rta 2.689ms, lost 0%
Host Unreachable	2016-12-08 10:35:16	HOST ALERT: hostname;UNREACHABLE;SOFT;8;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:30:06	HOST ALERT: hostname;UNREACHABLE;SOFT;7;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:28:56	HOST ALERT: hostname;UNREACHABLE;SOFT;6;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:27:46	HOST ALERT: hostname;UNREACHABLE;SOFT;5;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:26:36	HOST ALERT: hostname;UNREACHABLE;SOFT;4;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:25:26	HOST ALERT: hostname;UNREACHABLE;SOFT;3;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:24:37	HOST ALERT: hostname;UNREACHABLE;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:23:58	HOST ALERT: hostname;UNREACHABLE;SOFT;8;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:22:48	HOST ALERT: hostname;UNREACHABLE;SOFT;7;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:21:38	HOST ALERT: hostname;UNREACHABLE;SOFT;6;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:20:28	HOST ALERT: hostname;UNREACHABLE;SOFT;5;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:19:58	HOST ALERT: hostname;UNREACHABLE;SOFT;4;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:18:48	HOST ALERT: hostname;UNREACHABLE;SOFT;3;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:17:38	HOST ALERT: hostname;UNREACHABLE;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:16:26	HOST ALERT: hostname;UNREACHABLE;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:15:16	HOST ALERT: hostname;UNREACHABLE;SOFT;5;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:14:26	HOST ALERT: hostname;UNREACHABLE;SOFT;4;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:13:07	HOST ALERT: hostname;UNREACHABLE;SOFT;3;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:11:46	HOST ALERT: hostname;UNREACHABLE;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:10:33	HOST ALERT: hostname;UNREACHABLE;SOFT;7;CRITICAL - 10.1.1.190: rta nan, lost 100%
Service Critical	2016-12-08 10:10:14	SERVICE ALERT: hostname;Ping;CRITICAL;HARD;5;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:09:34	HOST ALERT: hostname;UNREACHABLE;SOFT;6;CRITICAL - 10.1.1.190: rta nan, lost 100%
Service Critical	2016-12-08 10:09:13	SERVICE ALERT: hostname;Ping;CRITICAL;SOFT;4;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:08:33	HOST ALERT: hostname;UNREACHABLE;SOFT;5;CRITICAL - 10.1.1.190: rta nan, lost 100%
Service Critical	2016-12-08 10:08:14	SERVICE ALERT: hostname;Ping;CRITICAL;SOFT;3;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:07:33	HOST ALERT: hostname;UNREACHABLE;SOFT;4;CRITICAL - 10.1.1.190: rta nan, lost 100%
Service Critical	2016-12-08 10:07:13	SERVICE ALERT: hostname;Ping;CRITICAL;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:06:33	HOST ALERT: hostname;UNREACHABLE;SOFT;3;CRITICAL - 10.1.1.190: rta nan, lost 100%
Service Critical	2016-12-08 10:06:13	SERVICE ALERT: hostname;Ping;CRITICAL;SOFT;1;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:06:03	HOST ALERT: hostname;UNREACHABLE;SOFT;2;CRITICAL - 10.1.1.190: rta nan, lost 100%
Host Unreachable	2016-12-08 10:04:53	HOST ALERT: hostname;UNREACHABLE;SOFT;1;CRITICAL - 10.1.1.190: rta nan, lost 100%
Information	2016-12-08 00:00:00	CURRENT SERVICE STATE: hostname;Ping;OK;HARD;1;OK - 10.1.1.190: rta 2.505ms, lost 0%
Information	2016-12-07 23:59:59	CURRENT HOST STATE: hostname;UP;HARD;1;OK - 10.1.1.190: rta 2.379ms, lost 0%
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: SOFT states look like they're resetting without an OK

Post by avandemore »

If you look at the state history report for hostname using States:Both, what is the output?
Can you include the definition for the host hostname? My best guess ATP is the object is set to initial_state = on, Retain status information = off and Nagios was restarted in the middle of this.
Previous Nagios employee
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: SOFT states look like they're resetting without an OK

Post by eloyd »

Nagios was not restarted in the middle. This was a test of their notification system, and it failed. They pulled the network cable on a switch and waited to see how long it would take for Nagios to notice. After ~30 minutes, they put the cable back in.

And that output was from the Event Log, filtered for that particular date and that, but the State History shows the same thing. I also note that the last and penultimate SOFT CRITICALs for the host were five minutes apart, not one minute apart as per the retry_check_interval you'll see in the host config.

Host definition, taken from a module which parses and displays objects.cache directly, is at the bottom.

Code: Select all

Date / Time     Host    Service State   State Type      Attempt Information
2016-12-08 10:39:46     hostname        Ping    OK      HARD    5 of 5  OK - 10.1.1.190: rta 2.613ms, lost 0%
2016-12-08 10:36:26     hostname                UP      SOFT    9 of 10 OK - 10.1.1.190: rta 2.689ms, lost 0%
2016-12-08 10:35:16     hostname                UNREACHABLE     SOFT    8 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:30:06     hostname                UNREACHABLE     SOFT    7 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:28:56     hostname                UNREACHABLE     SOFT    6 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:27:46     hostname                UNREACHABLE     SOFT    5 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:26:36     hostname                UNREACHABLE     SOFT    4 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:25:26     hostname                UNREACHABLE     SOFT    3 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:24:37     hostname                UNREACHABLE     SOFT    2 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:23:58     hostname                UNREACHABLE     SOFT    8 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:22:48     hostname                UNREACHABLE     SOFT    7 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:21:38     hostname                UNREACHABLE     SOFT    6 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:20:28     hostname                UNREACHABLE     SOFT    5 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:19:58     hostname                UNREACHABLE     SOFT    4 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:18:48     hostname                UNREACHABLE     SOFT    3 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:17:38     hostname                UNREACHABLE     SOFT    2 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:16:26     hostname                UNREACHABLE     SOFT    2 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:15:16     hostname                UNREACHABLE     SOFT    5 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:14:26     hostname                UNREACHABLE     SOFT    4 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:13:07     hostname                UNREACHABLE     SOFT    3 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:11:46     hostname                UNREACHABLE     SOFT    2 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:10:33     hostname                UNREACHABLE     SOFT    7 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:10:14     hostname        Ping    CRITICAL        HARD    5 of 5  CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:09:34     hostname                UNREACHABLE     SOFT    6 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:09:13     hostname        Ping    CRITICAL        SOFT    4 of 5  CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:08:33     hostname                UNREACHABLE     SOFT    5 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:08:14     hostname        Ping    CRITICAL        SOFT    3 of 5  CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:07:33     hostname                UNREACHABLE     SOFT    4 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:07:13     hostname        Ping    CRITICAL        SOFT    2 of 5  CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:06:33     hostname                UNREACHABLE     SOFT    3 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:06:13     hostname        Ping    CRITICAL        SOFT    1 of 5  CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:06:03     hostname                UNREACHABLE     SOFT    2 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%
2016-12-08 10:04:53     hostname                UNREACHABLE     SOFT    1 of 10 CRITICAL - 10.1.1.190: rta nan, lost 100%

Code: Select all

Name    Value
Host Name       hostname
Alias/Description       hostname
Address 1.2.3.4
Importance (Host)       0
Importance (Host + Services)    0
Parent Hosts    parent-host
Max. Check Attempts     10
Check Interval  0h 5m 0s
Retry Interval  0h 1m 0s
Host Check Command      check-host-alive!!!!!!!!
Check Period    24x7
Obsess Over     Yes
Enable Active Checks    Yes
Enable Passive Checks   Yes
Check Freshness No
Freshness Threshold     Auto-determined value
Default Contacts/Groups cg-for-this-host
Notification Interval   0h 30m 0s
First Notification Delay        0h 0m 0s
Notification Options    Down, Unreachable, Recovery
Notification Period     24x7
Event Handler
Enable Event Handler    Yes
Stalking Options        None
Enable Flap Detection   Yes
Low Flap Threshold      Program-wide value
High Flap Threshold     Program-wide value
Flap Detection Options  Up, Down, Unreachable
Process Performance Data        Yes
Notes
Notes URL
Action URL
2-D Coords
3-D Coords
Statusmap Image
VRML Image
Logo Image
Image Alt
Retention Options       Status Information, Non-Status Information
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: SOFT states look like they're resetting without an OK

Post by avandemore »

Can you compress and send over /usr/local/nagios/var/nagios.log? PM if it's easier.
Previous Nagios employee
User avatar
eloyd
Cool Title Here
Posts: 2190
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: SOFT states look like they're resetting without an OK

Post by eloyd »

Not for a while. I'm on vacation and the customer is a school. No urgency. Will play with it.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: SOFT states look like they're resetting without an OK

Post by avandemore »

Sure, I'll leave this open for your convenience.
Previous Nagios employee
Locked