alert self-healing issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
DataAssure
Posts: 34
Joined: Thu Jul 31, 2014 8:36 am

alert self-healing issue

Post by DataAssure »

Hi Support,
We are testing alert escalation and notice something weird is happening in our lab. Here is the situation:
A Windows host went down then we received AD Domain Service alert but even that Windows host resumed to normal operation including the AD Domain Service, Nagios kept reporting that Windows is down but that particular host reported to normal status the next day or two. We would like to know how to troubleshooting this self-healing symptom? That's where to look for the root cause? TIA
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: alert self-healing issue

Post by tgriep »

I just want to verify that a Windows system went down and that you received a Host Notification and a a Service Notification for a service called AD Domain Service.
Then the host was brought back up and you continued to receive the Host and Service notification was down until a a day or 2 went by and then they both returned to a normal state, it that what happened?
Did you have the same symptoms for other Hosts and Service on the system?
Are you checking any other services on that host and did they act the same?

Can you post the Email notification you were receiving at that time?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: alert self-healing issue

Post by cdienger »

First I would have a look at https://assets.nagios.com/downloads/nag ... ations.pdf to help make sure the escalation config is proper. You can also look at the host & service config and escalation config by reviewing /usr/local/nagios/etc/hosts/*hostcfg*, /usr/local/nagios/etc/services/*servicecfg*, /usr/local/nagios/etc/hostescalations.cfg, and /usr/local/nagios/etc/serviceescalations.cfg.

While it is in the misbehaving state, I would look at /usr/local/nagios/var/nagios.log which should contain entries each time the host/service changes state. status.dat would be a good place to check as well - it will contain status information and a count of notifications sent. nagios.log can roll over and have a lot of noise in it so being ready to run grep against it will be an advantage. Status.dat is also frequently updated too so unless it's in the bad state it probably wont be able to tell you much now.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked