Page 1 of 1
alert self-healing issue
Posted: Sun Apr 15, 2018 10:46 am
by DataAssure
Hi Support,
We are testing alert escalation and notice something weird is happening in our lab. Here is the situation:
A Windows host went down then we received AD Domain Service alert but even that Windows host resumed to normal operation including the AD Domain Service, Nagios kept reporting that Windows is down but that particular host reported to normal status the next day or two. We would like to know how to troubleshooting this self-healing symptom? That's where to look for the root cause? TIA
Re: alert self-healing issue
Posted: Mon Apr 16, 2018 3:11 pm
by tgriep
I just want to verify that a Windows system went down and that you received a Host Notification and a a Service Notification for a service called AD Domain Service.
Then the host was brought back up and you continued to receive the Host and Service notification was down until a a day or 2 went by and then they both returned to a normal state, it that what happened?
Did you have the same symptoms for other Hosts and Service on the system?
Are you checking any other services on that host and did they act the same?
Can you post the Email notification you were receiving at that time?
Re: alert self-healing issue
Posted: Mon Apr 16, 2018 3:15 pm
by cdienger
First I would have a look at
https://assets.nagios.com/downloads/nag ... ations.pdf to help make sure the escalation config is proper. You can also look at the host & service config and escalation config by reviewing /usr/local/nagios/etc/hosts/*hostcfg*, /usr/local/nagios/etc/services/*servicecfg*, /usr/local/nagios/etc/hostescalations.cfg, and /usr/local/nagios/etc/serviceescalations.cfg.
While it is in the misbehaving state, I would look at
/usr/local/nagios/var/nagios.log which should contain entries each time the host/service changes state.
status.dat would be a good place to check as well - it will contain status information and a count of notifications sent.
nagios.log can roll over and have a lot of noise in it so being ready to run
grep against it will be an advantage.
Status.dat is also frequently updated too so unless it's in the bad state it probably wont be able to tell you much now.