Hi all,
This weekend I had a group of Hosts going down for maintenance purposes.
I added them to a temporary hostgroup and dropped a downtime schedule on that group.
After rebooting, some hosts and some services (not all related to each other) did not return to an ok state.
The downtime period expired, so I expect that Nagios would send out notifications of these non-ok states. It didn't.
Our techs had to find out by looking at the Xi operations centre that some services we're still not OK.
By default, all hosts have Notifications disabled. This was mentioned by some as the possible culprit.
Looking through manuals, I cannot find anything to confirm this statement.
Note: My setup for a scheduled downtime for a host is adding schedules for both host and it's services. so basically there's two schedule periods active.
I read that scheduling only the host is enough to also hush it's associated services. Is that correct? Could that have anything to do with it?
notification troubles after downtime end.
Re: notification troubles after downtime end.
Here is a good writeup in the online Core documentation for notifications:
http://nagios.sourceforge.net/docs/nagi ... tions.html
From one of your points regarding the service checks:
http://nagios.sourceforge.net/docs/nagi ... tions.html
From one of your points regarding the service checks:
As a side note, notifications for services are suppressed if the host they're associated with is in a period of scheduled downtime.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: notification troubles after downtime end.
If these host and service objects were acknowledged. then no notifications will be sent from that point on. This is regardless if a downtime period is currently in effect.MichielvM wrote: After rebooting, some hosts and some services (not all related to each other) did not return to an ok state.
The downtime period expired, so I expect that Nagios would send out notifications of these non-ok states. It didn't.
Also, if you have host and service escalations defined that only run for x amount of notifications, once these pass then no more notifications will be sent.
I hope some of this helps.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: notification troubles after downtime end.
I have had the same issue where we schedule downtime for serevr patching. Unfortunately if a server is still down at the end of the downtime becauase it didn't restart , Nagios *does not* send a notification.
I have logged a case for this as it now means that I can't rely on Nagios for a lights out operation. Someone needs to check the Nagios screen after the patching to ensure that all servers came up again.
Unfortunately, this doesn't seem to be a priority as there has been no response to this since I logged this 2 months ago.
http://tracker.nagios.org/view.php?id=660
Fred
I have logged a case for this as it now means that I can't rely on Nagios for a lights out operation. Someone needs to check the Nagios screen after the patching to ensure that all servers came up again.
Unfortunately, this doesn't seem to be a priority as there has been no response to this since I logged this 2 months ago.
http://tracker.nagios.org/view.php?id=660
Fred
Re: notification troubles after downtime end.
Seems logical to me, that if a host does not check OK after Downtime ends there should be bells and whistles all over the damn place.Fred Kroeger wrote:I have had the same issue where we schedule downtime for serevr patching. Unfortunately if a server is still down at the end of the downtime becauase it didn't restart , Nagios *does not* send a notification.
I have logged a case for this as it now means that I can't rely on Nagios for a lights out operation. Someone needs to check the Nagios screen after the patching to ensure that all servers came up again.
Unfortunately, this doesn't seem to be a priority as there has been no response to this since I logged this 2 months ago.
http://tracker.nagios.org/view.php?id=660
Fred
Hint to Nagios: prioritize Fred's case!
Re: notification troubles after downtime end.
What XI and Core versions are you on? I tried to replicate but could not. Here was my setup:
1.) Create a dummy service attached to localhost that does check_dummy 0 with flapping off, notifications on a 5-minute repeat, and checks on 1 minute for both OK and non-OK states
2.) Force some checks to get a history going
3.) Change it to check_dummy 2 to produce a critical state
4.) Force more checks, receive critical email
5.) Schedule 5-minute downtime
6.) Force even more checks during that downtime, do not receive email
7.) Downtime ends, within 5 minutes I have an email notifying me of a critical state
1.) Create a dummy service attached to localhost that does check_dummy 0 with flapping off, notifications on a 5-minute repeat, and checks on 1 minute for both OK and non-OK states
2.) Force some checks to get a history going
3.) Change it to check_dummy 2 to produce a critical state
4.) Force more checks, receive critical email
5.) Schedule 5-minute downtime
6.) Force even more checks during that downtime, do not receive email
7.) Downtime ends, within 5 minutes I have an email notifying me of a critical state
Former Nagios employee
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: notification troubles after downtime end.
I was able to reproduce the issue in XI, however I was working with a host, not a service. One potential workaround would be to add in the notification for scheduled downtime events as that will indicate for you the status of the host at both the beginning and the end of the downtime so you can react to any remaining down. This is far from ideal though, I would agree... especially if you have 1000 hosts affected by the downtime but only 1 doesn't recover.
I will bring this up with the developers and see if there is some underlying logic that we're overlooking. If they can't provide any we'll indicate the severity of the bug to them and hopefully it gets pushed up the list.
I will bring this up with the developers and see if there is some underlying logic that we're overlooking. If they can't provide any we'll indicate the severity of the bug to them and hopefully it gets pushed up the list.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: notification troubles after downtime end.
All Notifications are set to 0 so only 1 notification is sent out. We send an email to a ticketing system so sending more than 1 notification is not an option.
Regards Fred
Regards Fred
Re: notification troubles after downtime end.
Core is : 4.0.8
Xi is : 2014R2.6
I'm gonna set up a sandbox to play with and post results back here. In the meantime I would apprciate it if your development team can shed some light on this.
I really have no clue what I could have overlooked when scheduling this downtime.
The host/services that failed had no active acknowledgements.
A point to add; The history graphs show nothing between downtimeend and the time our tech dept. fixed the host.
Xi is : 2014R2.6
I'm gonna set up a sandbox to play with and post results back here. In the meantime I would apprciate it if your development team can shed some light on this.
I really have no clue what I could have overlooked when scheduling this downtime.
The host/services that failed had no active acknowledgements.
A point to add; The history graphs show nothing between downtimeend and the time our tech dept. fixed the host.
Re: notification troubles after downtime end.
In my OP I mentioned that by default all hosts have a no-notify profile. We only react to service checks.
Is it possible that this has something to do with it?
Is it possible that this has something to do with it?