Flexible Schedule Downtime Fails to Clear

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
GreatWolfResorts
Posts: 48
Joined: Tue Mar 15, 2011 11:12 am
Location: Madison, WI
Contact:

Flexible Schedule Downtime Fails to Clear

Post by GreatWolfResorts »

We ran into an issue yesterday and this morning where a server was placed into a flexible scheduled downtime maintenance of 1 hour. The downtime kicked in properly when the system was taken down for maintenance. However, after the server came back up, and the 1 hour window lapsed, the host remained in a downtime state. This lasted through-out the night, and to my surprise, supressed ligitimate alerts when the server did crash late last night. This morning I ended up walking into a massive fire because we weren't aware of the issue.

The only way I could take it out of downtime state was to manually delete the instance in the schedule downtime section. Any idea what may be going on here?
Nagios XI 5.2.5 | CentOS6.3 x86_64 | Virtual Instance on VMware vSphere 6
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Flexible Schedule Downtime Fails to Clear

Post by mguthrie »

We did apply a patch to Core in 3.2 that could be related to this issue:
From 3.2 CHANGELOG
- Patched Nagios Core bug #338 where schedule downtime would not persist properly upon a restart of Nagios (Carlos Velasco) - MG
Here's the tracker item for the known bug, it's likely these are related to what we fixed in 3.2.
http://tracker.nagios.org/view.php?id=338
User avatar
GreatWolfResorts
Posts: 48
Joined: Tue Mar 15, 2011 11:12 am
Location: Madison, WI
Contact:

Re: Flexible Schedule Downtime Fails to Clear

Post by GreatWolfResorts »

I'll move forward with the 3.2 patch testing and implementation, and will give a quick update as to the results. Thank you sir!
Nagios XI 5.2.5 | CentOS6.3 x86_64 | Virtual Instance on VMware vSphere 6
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Flexible Schedule Downtime Fails to Clear

Post by slansing »

Sounds good, let us know how it works for you.
User avatar
GreatWolfResorts
Posts: 48
Joined: Tue Mar 15, 2011 11:12 am
Location: Madison, WI
Contact:

Re: Flexible Schedule Downtime Fails to Clear

Post by GreatWolfResorts »

We have version 3.2 in place in production. It appears this did not resolve the issue for us. We placed a server in scheduled downtime flexible 15 minutes. Maintenance was completed on the unit, however, it appears to still be in scheduled downtime status a couple hours later.
Nagios XI 5.2.5 | CentOS6.3 x86_64 | Virtual Instance on VMware vSphere 6
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Flexible Schedule Downtime Fails to Clear

Post by scottwilkerson »

Are we sure this was set to "flexible"? I know the default is fixed with a 2 hour time window.

We have been testing this and Have only seen expected behavior.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
GreatWolfResorts
Posts: 48
Joined: Tue Mar 15, 2011 11:12 am
Location: Madison, WI
Contact:

Re: Flexible Schedule Downtime Fails to Clear

Post by GreatWolfResorts »

That is correct, it was set as flexible (0 hours, 15 minutes). I attached a couple snapshots of our scheduled host and service downtime screen as well as the hosts in question. Both hosts display they are still in a downtime status. And that's a good 26 hours later. Here is the event log as well. The flexible scheduled downtime should show up at a cancel but we received nothing:

Code: Select all

2012-07-18 13:10:07SERVICE ALERT: GWR-SPWFE-IMM;IMM Health Check;OK;HARD;5;Health status: Normal
2012-07-18 13:08:31SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;OK;HARD;2;Fan 1A Tach = 38%
2012-07-18 13:07:40SERVICE ALERT: GWR-SPWFE-IMM;IMM Voltage Check;OK;HARD;1;Planar 3.3V = 3340
2012-07-18 13:04:22SERVICE ALERT: GWR-SPWFE;Memory Usage;OK;HARD;1;Memory usage: total:24487.46 Mb - used: 1707.80 Mb (7%) - free: 22779.65 Mb (93%)
2012-07-18 13:03:30SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;OK;HARD;2;C: - total: 135.75 Gb - used: 42.31 Gb (31%) - free 93.43 Gb (69%)
2012-07-18 13:02:54SERVICE ALERT: GWR-SPWFE;SharePoint Services;OK;HARD;1;OK: All services are in their appropriate state.
2012-07-18 13:02:51SERVICE ALERT: GWR-SPWFE;CPU Usage;OK;HARD;1;CPU Load 0% (5 min average)
2012-07-18 13:00:54HOST ALERT: GWR-SPWFE;UP;HARD;1;OK - : rta 0.579ms, lost 0%
2012-07-18 13:00:12HOST ALERT: GWR-SPWFE-IMM;UP;HARD;1;OK - : rta 0.806ms, lost 0%
2012-07-18 12:55:15SERVICE ALERT: GWR-SPWFE-IMM;IMM Health Check;UNKNOWN;HARD;5;Health status: Unknown
2012-07-18 12:55:09HOST ALERT: GWR-SPWFE-IMM;DOWN;HARD;5;CRITICAL - : rta nan, lost 100%
2012-07-18 12:54:12HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;4;CRITICAL - : rta nan, lost 100%
2012-07-18 12:53:57HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;3;CRITICAL - : rta nan, lost 100%
2012-07-18 12:53:33SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;UNKNOWN;HARD;2;No fans
2012-07-18 12:53:00HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;2;CRITICAL - : rta nan, lost 100%
2012-07-18 12:52:48SERVICE ALERT: GWR-SPWFE-IMM;IMM Voltage Check;UNKNOWN;HARD;1;No voltages
2012-07-18 12:52:45HOST DOWNTIME ALERT: GWR-SPWFE-IMM;STARTED; Host has entered a period of scheduled downtime
2012-07-18 12:52:45HOST ALERT: GWR-SPWFE-IMM;DOWN;SOFT;1;CRITICAL - : rta nan, lost 100%
2012-07-18 12:52:33SERVICE ALERT: GWR-SPWFE-IMM;IMM Fan Check;UNKNOWN;SOFT;1;No fans
2012-07-18 12:45:42HOST ALERT: GWR-SPWFE;DOWN;HARD;5;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:51HOST ALERT: GWR-SPWFE;DOWN;SOFT;4;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:39HOST ALERT: GWR-SPWFE;DOWN;SOFT;3;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:44:24SERVICE ALERT: GWR-SPWFE;Memory Usage;CRITICAL;HARD;1;No route to host
2012-07-18 12:43:45HOST ALERT: GWR-SPWFE;DOWN;SOFT;2;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:43:30SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;CRITICAL;HARD;2;No route to host
2012-07-18 12:42:54SERVICE ALERT: GWR-SPWFE;SharePoint Services;CRITICAL;HARD;1;No route to host
2012-07-18 12:42:54SERVICE ALERT: GWR-SPWFE;CPU Usage;CRITICAL;HARD;1;No route to host
2012-07-18 12:42:39HOST DOWNTIME ALERT: GWR-SPWFE;STARTED; Host has entered a period of scheduled downtime
2012-07-18 12:42:39HOST ALERT: GWR-SPWFE;DOWN;SOFT;1;CRITICAL - : Host unreachable @ . rta nan, lost 100%
2012-07-18 12:42:33SERVICE ALERT: GWR-SPWFE;Drive C: Disk Usage;CRITICAL;SOFT;1;No route to host
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Voltage Check;OK;HARD;1;Planar 3.3V = 3340
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Temperature Check;OK;HARD;1;Ambient Temp = 21
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Health Check;CRITICAL;HARD;5;Health status: System level error
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE-IMM;IMM Fan Check;OK;HARD;1;Fan 1A Tach = 39%
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;SharePoint Services;OK;HARD;1;OK: All services are in their appropriate state.
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;Memory Usage;OK;HARD;1;Memory usage: total:24487.46 Mb - used: 3554.03 Mb (15%) - free: 20933.42 Mb (85%)
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;Drive C: Disk Usage;OK;HARD;1;C: - total: 135.75 Gb - used: 42.29 Gb (31%) - free 93.46 Gb (69%)
2012-07-18 00:00:00CURRENT SERVICE STATE: GWR-SPWFE;CPU Usage;OK;HARD;1;CPU Load 0% (5 min average)
2012-07-18 00:00:00CURRENT HOST STATE: GWR-SPWFE-IMM;UP;HARD;1;OK - : rta 1.045ms, lost 0%
2012-07-18 00:00:00CURRENT HOST STATE: GWR-SPWFE;UP;HARD;1;OK - : rta 0.153ms, lost 0%
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.2.5 | CentOS6.3 x86_64 | Virtual Instance on VMware vSphere 6
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Flexible Schedule Downtime Fails to Clear

Post by lmiltchev »

I was looking at your screenshots yesterday, and I noticed that in fact the scheduled host & service downtime was "Fixed", not "Flexible", and it was "in the future". Please, see the dates:

Host downtime: 07-25-2012
Service downtime: 07-20-2012

I am not sure what is going on. We were not able to recreate this issue. Could you, please, delete, the scheduled host & service downtime and add a new (valid) schedule?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
GreatWolfResorts
Posts: 48
Joined: Tue Mar 15, 2011 11:12 am
Location: Madison, WI
Contact:

Re: Flexible Schedule Downtime Fails to Clear

Post by GreatWolfResorts »

Host "GWR-SPWFE" and "GWR-SPWFE-IMM" were actually the devices in question. That's why I included the screenshot listing them showing in downtime status. I included the screenshot displaying the scheduled downtime to display the fact that although both GWR-SPWFE hosts are are still in a scheduled downtime state, they were not displaying in the scheduled downtime screen. The only two displaying there were GWR-HELPDESK and VA-SPA's OpenCourse service check. Both those checks are reoccuring and fixed, so they should be displaying future dates. Consider those items irrelivent to the situation, however I will remove the reoccuring downtime items and test a flexible schedule downtime and report back here when complete.

Thanks!
Nagios XI 5.2.5 | CentOS6.3 x86_64 | Virtual Instance on VMware vSphere 6
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Flexible Schedule Downtime Fails to Clear

Post by lmiltchev »

Sounds good. Let us know how it went. Meanwhile, we will do some more digging and testing. Thanks!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked