Page 1 of 4
Critical notifications not being sent from XI
Posted: Thu May 30, 2019 5:08 pm
by rferebee
Hello,
We experienced an issue this month with a service check for one of our Windows servers. There is a service check configured for the C: drive which is set to go Critical when the drive space used exceeds 56.9GB. The service check went Critical May 1st at 22:30:27 and proceeded to send out a notification to the contact group assigned to it.
The issue is that the service remained in a Critical for 10 days and it is configured to send out an alert every 24 hours to the same contact group. For whatever reason, the service check only notified the Nagios Admin contact group using the 24 hour interval for the days the service was critical.
When the status changed on May 11th back to Warning, there was no notification sent to the contact group. The status then went back to Critical on May 19th and remained there for 3 days and no notifications were sent.
The server ended up cratering because no one was made aware of the issue via Nagios notifications. My superiors would like me to figure out why this happened to prevent it from happening in the future. Please see attached notification log and the graphical representation of the service status. I can also provide a System Profile if necessary.
Thank you.
Re: Critical notifications not being sent from XI
Posted: Fri May 31, 2019 10:33 am
by cdienger
Please PM me a profile(Admin > System Config > System Profile > Download Profile) along with May's logs found in /usr/local/nagios/var/archives/ - please compress them if they are not already.
What is the name of the other contact group it should have notified?
Re: Critical notifications not being sent from XI
Posted: Fri May 31, 2019 4:00 pm
by cdienger
Data received and being reviewed. Are you certain about the dates and do you have emails from those days that can be provided? I'm just trying to line things up as best as possible and what I'm seeing is:
-The service is only configured to send recovery and critical alerts and not warnings.
-The service is currently configured to be escalated. It's first escalated notification is the 5 notification sent.
-The service is currently configured to escalate a second time. The second escalation occurs the 9th notification.
-The escalations are configured to notify a select group(probably Admin group you refer to).
This should help explain some of the behavior but I do still see some behavior I can't quite explain(a gap between the 10th and 22nd and a "custom" dispatcher).
It does look like there have been changes to the contactgroup memberships and possibly the notification handler. Are you aware of any changes that were made during the month to either of these or escalations?
Re: Critical notifications not being sent from XI
Posted: Fri May 31, 2019 5:45 pm
by rferebee
I am fairly certain about the dates of the events. I can PM you emails that my team received during the time this issue was happening.
Escalations are configured for most of our Service Checks, if after 5 days no one acknowledges or resolves the service alert then an escalation email is sent to our Nagios Admin group every day for 5 days and once more on the 9th day. Those emails seemed to go out just fine during this time.
The ones I'm worried about are the ones that were supposed to be sent to the ServerSupportContact group. It appears, looking at the notification log, that only one notification was sent to that contact group on May 1st and then never again despite the Alert Settings being configured to send a notification every 1440 minutes. See screen shot attached.
Where are you seeing that only Recovery and Critical alerts are sent out?
The ServerSupportContact group is added to multiple service checks weekly, basically whenever we add new devices for monitoring it's possible that group will be added to alerting. I'm not aware of any changes to the notification handler for the group.
Re: Critical notifications not being sent from XI
Posted: Mon Jun 03, 2019 12:40 pm
by cdienger
Thanks for confirming. I'm labbing this up to see if I can reproduce and will keep you updated.
Re: Critical notifications not being sent from XI
Posted: Mon Jun 03, 2019 2:45 pm
by rferebee
The last notification that my XI environment sent out was 5/24/2019 at 10:15:00 and hasn't sent one out since.
Something happened, I'm not sure what.
Re: Critical notifications not being sent from XI
Posted: Mon Jun 03, 2019 4:17 pm
by rferebee
Wait. Somehow my fail over server backup was written to my Production server which disabled notifications.
Re: Critical notifications not being sent from XI
Posted: Mon Jun 03, 2019 4:33 pm
by rferebee
I don't know what the heck happened, but I think I have it fixed. Somehow the settings in my Prod environment were overwritten by the settings in my fail over environment sometime on or before May 24th.
My fail over environment is configured to not send out notifications, so my Prod environment hasn't been sending out any alerts since the 24th... damn.
Having 3 different environments that all backup to and from each other is a real pain sometimes.
Re: Critical notifications not being sent from XI
Posted: Tue Jun 04, 2019 11:16 am
by cdienger
Thanks for the update! Are we okay to lock this one up or did you have any further questions/concerns about this?
Re: Critical notifications not being sent from XI
Posted: Tue Jun 04, 2019 11:20 am
by rferebee
Hold on, sorry for the confusion.
The original issue I opened this thread for is NOT resolved. We still need to figure out why the ServerSupportContact group didn't get their notifications from May 11th-19th.
The issue I was talking about yesterday was something different entirely. Honestly, I shouldn't even have mentioned it in this thread.