Remove Acknowledement via API
Posted: Mon Jan 21, 2019 9:38 am
Hi,
I'm currently setting up some active checks which monitor some batch processing jobs checking to see if they have failed.
Typically, when they fail, they'll either be re-ran, either automatically or manually.
I'm concerned about Nagios missing out on a failure whereby a job fails (Nagios alert then acknoweldged) and is re-ran, but fails almost immediately before Nagios had a chance to see it was OK (job re-running) and so as the state wouldn''t have changed so no alert would've been generated as the previous and current states are the same (critical).
As an example:
Job starts at 11PM
Job fails at 11:30PM
At 11:35PM Nagios checks and goes critical, our monitoring team acknowledge the alert and kick the job off again.
At 11:37PM, the running job fails again, but Nagios isn't due to re-check the alert for another few minutes.
At 11:40PM, Nagios looks at the job a few minutes later, see's the state hasn't changed (still critical) and so the will not re-alert.
Is there any logic I can build to get around this (short of not asking the teams to not acknowledge alerts)?
My only thinking is to amend my plugin to check if the alert is acknowledged, remove it if so, and then send a critical status, can removing acknowledgements be done via the API?
I'm currently setting up some active checks which monitor some batch processing jobs checking to see if they have failed.
Typically, when they fail, they'll either be re-ran, either automatically or manually.
I'm concerned about Nagios missing out on a failure whereby a job fails (Nagios alert then acknoweldged) and is re-ran, but fails almost immediately before Nagios had a chance to see it was OK (job re-running) and so as the state wouldn''t have changed so no alert would've been generated as the previous and current states are the same (critical).
As an example:
Job starts at 11PM
Job fails at 11:30PM
At 11:35PM Nagios checks and goes critical, our monitoring team acknowledge the alert and kick the job off again.
At 11:37PM, the running job fails again, but Nagios isn't due to re-check the alert for another few minutes.
At 11:40PM, Nagios looks at the job a few minutes later, see's the state hasn't changed (still critical) and so the will not re-alert.
Is there any logic I can build to get around this (short of not asking the teams to not acknowledge alerts)?
My only thinking is to amend my plugin to check if the alert is acknowledged, remove it if so, and then send a critical status, can removing acknowledgements be done via the API?