Page 1 of 1

Acknowledgement Refresh

Posted: Fri Mar 16, 2012 2:39 pm
by joe1871
We have technicians who are using the Acknowledgement feature of Nagios to quiet on an alert, but then they are not following up on the problem. THis reaed up its very ugly head today when a disk capacity warning that had been acknowledged turned into a full drive on a key machine. We are now recovering from that minor disaster. I am surprised that an Ack of an alert would supress that alert indefinitely. I would suspect that Nagios has some logic that says if the condition persists for x period of time after an Ack then re-alert? Anybody know if this is in there? Thanks.

Re: Acknowledgement Refresh

Posted: Sun Mar 18, 2012 5:07 pm
by jsmurphy
I don't think it has that functionality... at least not that I am aware of (I've never really looked). We had the same issue early on and we deemed it to be an education issue more so than an application issue, because chances are if they just acknowledged and ignored it the first time then they will probably do the same thing if it re-alerts. We made sure our users understood the importance of using the right type of downtime and there were a couple who were resistant to doing things the right way, but after the first business visible failure management bore down on them for misusing the system after they had been taught how to use it and the problem pretty much went away.

If larger numbers of people are still doing it... it could be indicative of the fact that your Nagios is being too verbose and your engineers are struggling to work out what's urgent/legitimate and what's just white noise. Just ask them and they will tell you if they are struggling with your current alerting regime... ultimately the monitoring is there to help them prevent failures and if they aren't finding it useful then it's not doing it's job.

Re: Acknowledgement Refresh

Posted: Mon Mar 19, 2012 9:46 am
by mguthrie
I would second what jsmurphy said. I know of a user with a large installation who ran a cron job to automatically delete comments and acknowledgments older than X amount of days, but I think re-tuning the notifications and also addressing the personnel issues about how problems are being handled is the real issue on this one...