Page 1 of 2
Flexible Schedule Downtime Fails to Clear
Posted: Thu Jul 12, 2012 4:58 pm
by GreatWolfResorts
We ran into an issue yesterday and this morning where a server was placed into a flexible scheduled downtime maintenance of 1 hour. The downtime kicked in properly when the system was taken down for maintenance. However, after the server came back up, and the 1 hour window lapsed, the host remained in a downtime state. This lasted through-out the night, and to my surprise, supressed ligitimate alerts when the server did crash late last night. This morning I ended up walking into a massive fire because we weren't aware of the issue.
The only way I could take it out of downtime state was to manually delete the instance in the schedule downtime section. Any idea what may be going on here?
Re: Flexible Schedule Downtime Fails to Clear
Posted: Fri Jul 13, 2012 9:45 am
by mguthrie
We did apply a patch to Core in 3.2 that could be related to this issue:
From 3.2 CHANGELOG
- Patched Nagios Core bug #338 where schedule downtime would not persist properly upon a restart of Nagios (Carlos Velasco) - MG
Here's the tracker item for the known bug, it's likely these are related to what we fixed in 3.2.
http://tracker.nagios.org/view.php?id=338
Re: Flexible Schedule Downtime Fails to Clear
Posted: Tue Jul 17, 2012 9:17 am
by GreatWolfResorts
I'll move forward with the 3.2 patch testing and implementation, and will give a quick update as to the results. Thank you sir!
Re: Flexible Schedule Downtime Fails to Clear
Posted: Tue Jul 17, 2012 9:19 am
by slansing
Sounds good, let us know how it works for you.
Re: Flexible Schedule Downtime Fails to Clear
Posted: Wed Jul 18, 2012 2:35 pm
by GreatWolfResorts
We have version 3.2 in place in production. It appears this did not resolve the issue for us. We placed a server in scheduled downtime flexible 15 minutes. Maintenance was completed on the unit, however, it appears to still be in scheduled downtime status a couple hours later.
Re: Flexible Schedule Downtime Fails to Clear
Posted: Thu Jul 19, 2012 2:28 pm
by scottwilkerson
Are we sure this was set to "flexible"? I know the default is fixed with a 2 hour time window.
We have been testing this and Have only seen expected behavior.
Re: Flexible Schedule Downtime Fails to Clear
Posted: Thu Jul 19, 2012 4:35 pm
by GreatWolfResorts
That is correct, it was set as flexible (0 hours, 15 minutes). I attached a couple snapshots of our scheduled host and service downtime screen as well as the hosts in question. Both hosts display they are still in a downtime status. And that's a good 26 hours later. Here is the event log as well. The flexible scheduled downtime should show up at a cancel but we received nothing:
Code: Select all
2012-07-18 13:10:07SERVICE ALERT: GWR-SPWFE-IMM;IMM Health Check;OK;HARD;5;Health status: Normal
2012-07-18 13:08:31SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;OK;HARD;2;Fan 1A Tach = 38%
2012-07-18 13:07:40SERVICE ALERT: GWR-SPWFE-IMM;IMM Voltage Check;OK;HARD;1;Planar 3.3V = 3340
2012-07-18 13:04:22SERVICE ALERT: GWR-SPWFE;Memory Usage;OK;HARD;1;Memory usage: total:24487.46 Mb - used: 1707.80 Mb (7%) - free: 22779.65 Mb (93%)
2012-07-18 13:03:30SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;OK;HARD;2;C: - total: 135.75 Gb - used: 42.31 Gb (31%) - free 93.43 Gb (69%)
2012-07-18 13:02:54SERVICE ALERT: GWR-SPWFE;SharePoint Services;OK;HARD;1;OK: All services are in their appropriate state.
2012-07-18 13:02:51SERVICE ALERT: GWR-SPWFE;CPU Usage;OK;HARD;1;CPU Load 0% (5 min average)
2012-07-18 13:00:54HOST ALERT: GWR-SPWFE;UP;HARD;1;OK - : rta 0.579ms, lost 0%
2012-07-18 13:00:12HOST ALERT: GWR-SPWFE-IMM;UP;HARD;1;OK - : rta 0.806ms, lost 0%
2012-07-18 12:55:15SERVICE ALERT: GWR-SPWFE-IMM;IMM Health Check;UNKNOWN;HARD;5;Health status: Unknown
2012-07-18 12:55:09HOST ALERT: GWR-SPWFE-IMM;DOWN;HARD;5;CRITICAL - : rta nan, lost 100%
2012-07-18 12:54:12HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;4;CRITICAL - : rta nan, lost 100%
2012-07-18 12:53:57HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;3;CRITICAL - : rta nan, lost 100%
2012-07-18 12:53:33SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;UNKNOWN;HARD;2;No fans
2012-07-18 12:53:00HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;2;CRITICAL - : rta nan, lost 100%
2012-07-18 12:52:48SERVICE ALERT: GWR-SPWFE-IMM;IMM Voltage Check;UNKNOWN;HARD;1;No voltages
2012-07-18 12:52:45HOST DOWNTIME ALERT: GWR-SPWFE-IMM;STARTED; Host has entered a period of scheduled downtime
2012-07-18 12:52:45HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;1;CRITICAL - : rta nan, lost 100%
2012-07-18 12:52:33SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;UNKNOWN;SOFT;1;No fans
2012-07-18 12:45:42HOST ALERT: GWR-SPWFE;DOWN;HARD;5;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:51HOST ALERT: GWR-SPWFE;DOWN;SOFT;4;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:39HOST ALERT: GWR-SPWFE;DOWN;SOFT;3;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:24SERVICE ALERT: GWR-SPWFE;Memory Usage;CRITICAL;HARD;1;No route to host
2012-07-18 12:43:45HOST ALERT: GWR-SPWFE;DOWN;SOFT;2;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:43:30SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;CRITICAL;HARD;2;No route to host
2012-07-18 12:42:54SERVICE ALERT: GWR-SPWFE;SharePoint Services;CRITICAL;HARD;1;No route to host
2012-07-18 12:42:54SERVICE ALERT: GWR-SPWFE;CPU Usage;CRITICAL;HARD;1;No route to host
2012-07-18 12:42:39HOST DOWNTIME ALERT: GWR-SPWFE;STARTED; Host has entered a period of scheduled downtime
2012-07-18 12:42:39HOST ALERT: GWR-SPWFE;DOWN;SOFT;1;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:42:33SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;CRITICAL;SOFT;1;No route to host
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Voltage Check;OK;HARD;1;Planar 3.3V = 3340
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Temperature Check;OK;HARD;1;Ambient Temp = 21
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Health Check;CRITICAL;HARD;5;Health status: System level error
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Fan Check;OK;HARD;1;Fan 1A Tach = 39%
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;SharePoint Services;OK;HARD;1;OK: All services are in their appropriate state.
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;Memory Usage;OK;HARD;1;Memory usage: total:24487.46 Mb - used: 3554.03 Mb (15%) - free: 20933.42 Mb (85%)
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;Drive C: Disk Usage;OK;HARD;1;C: - total: 135.75 Gb - used: 42.29 Gb (31%) - free 93.46 Gb (69%)
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;CPU Usage;OK;HARD;1;CPU Load 0% (5 min average)
2012-07-18 00:00:00CURRENT HOST STATE: GWR-SPWFE-IMM;UP;HARD;1;OK - : rta 1.045ms, lost 0%
2012-07-18 00:00:00CURRENT HOST STATE: GWR-SPWFE;UP;HARD;1;OK - : rta 0.153ms, lost 0%
Re: Flexible Schedule Downtime Fails to Clear
Posted: Fri Jul 20, 2012 10:05 am
by lmiltchev
I was looking at your screenshots yesterday, and I noticed that in fact the scheduled host & service downtime was "Fixed", not "Flexible", and it was "in the future". Please, see the dates:
Host downtime: 07-25-2012
Service downtime: 07-20-2012
I am not sure what is going on. We were not able to recreate this issue. Could you, please, delete, the scheduled host & service downtime and add a new (valid) schedule?
Re: Flexible Schedule Downtime Fails to Clear
Posted: Tue Jul 24, 2012 10:13 am
by GreatWolfResorts
Host "GWR-SPWFE" and "GWR-SPWFE-IMM" were actually the devices in question. That's why I included the screenshot listing them showing in downtime status. I included the screenshot displaying the scheduled downtime to display the fact that although both GWR-SPWFE hosts are are still in a scheduled downtime state, they were not displaying in the scheduled downtime screen. The only two displaying there were GWR-HELPDESK and VA-SPA's OpenCourse service check. Both those checks are reoccuring and fixed, so they should be displaying future dates. Consider those items irrelivent to the situation, however I will remove the reoccuring downtime items and test a flexible schedule downtime and report back here when complete.
Thanks!
Re: Flexible Schedule Downtime Fails to Clear
Posted: Tue Jul 24, 2012 10:57 am
by lmiltchev
Sounds good. Let us know how it went. Meanwhile, we will do some more digging and testing. Thanks!