Hi,
I'm currently setting up some active checks which monitor some batch processing jobs checking to see if they have failed.
Typically, when they fail, they'll either be re-ran, either automatically or manually.
I'm concerned about Nagios missing out on a failure whereby a job fails (Nagios alert then acknoweldged) and is re-ran, but fails almost immediately before Nagios had a chance to see it was OK (job re-running) and so as the state wouldn''t have changed so no alert would've been generated as the previous and current states are the same (critical).
As an example:
Job starts at 11PM
Job fails at 11:30PM
At 11:35PM Nagios checks and goes critical, our monitoring team acknowledge the alert and kick the job off again.
At 11:37PM, the running job fails again, but Nagios isn't due to re-check the alert for another few minutes.
At 11:40PM, Nagios looks at the job a few minutes later, see's the state hasn't changed (still critical) and so the will not re-alert.
Is there any logic I can build to get around this (short of not asking the teams to not acknowledge alerts)?
My only thinking is to amend my plugin to check if the alert is acknowledged, remove it if so, and then send a critical status, can removing acknowledgements be done via the API?
Remove Acknowledement via API
Re: Remove Acknowledement via API
I would assert that the monitoring team shouldn't be "acknowledging" a problem unless they intend to see it through to resolution. In which case, fresh alerts wouldn't matter because a human is paying specific attention to the problem state waiting for it to recover.
There are 2 external commands for removing acknowledgements for a particular check:
REMOVE_HOST_ACKNOWLEDGEMENT
REMOVE_SVC_ACKNOWLEDGEMENT
If your plugin is running locally relative to your Nagios XI machine, those external commands are one option.
If this plugin is executed remotely (via NRPE, NCPA, NSClient++, etc) I don't believe the API for XI currently supports those external commands, but an NRDP endpoint could. More info on NRDP within Nagios XI (see Page 4 for command examples):
https://assets.nagios.com/downloads/nag ... erview.pdf
There are 2 external commands for removing acknowledgements for a particular check:
REMOVE_HOST_ACKNOWLEDGEMENT
REMOVE_SVC_ACKNOWLEDGEMENT
If your plugin is running locally relative to your Nagios XI machine, those external commands are one option.
If this plugin is executed remotely (via NRPE, NCPA, NSClient++, etc) I don't believe the API for XI currently supports those external commands, but an NRDP endpoint could. More info on NRDP within Nagios XI (see Page 4 for command examples):
https://assets.nagios.com/downloads/nag ... erview.pdf
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Remove Acknowledement via API
@JGCG, I agree with @mcapra. Perhaps you should increase the notification interval for this service? That way if a team member receives an email notification he will restart the job and by the time another email notification will be due, the job will either recover to OK state or still be critical and send another notification(as it should). Not sure if I 100% understand your set up.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Remove Acknowledement via API
Thanks guys. I'll take a look at the solutions mentioned; this can be resolved.