Nagios Support Forum

Posted: **Mon Jan 21, 2019 9:38 am**

Hi,

I'm currently setting up some active checks which monitor some batch processing jobs checking to see if they have failed.
Typically, when they fail, they'll either be re-ran, either automatically or manually.

I'm concerned about Nagios missing out on a failure whereby a job fails (Nagios alert then acknoweldged) and is re-ran, but fails almost immediately before Nagios had a chance to see it was OK (job re-running) and so as the state wouldn''t have changed so no alert would've been generated as the previous and current states are the same (critical).

As an example:

Job starts at 11PM
Job fails at 11:30PM
At 11:35PM Nagios checks and goes critical, our monitoring team acknowledge the alert and kick the job off again.
At 11:37PM, the running job fails again, but Nagios isn't due to re-check the alert for another few minutes.
At 11:40PM, Nagios looks at the job a few minutes later, see's the state hasn't changed (still critical) and so the will not re-alert.

Is there any logic I can build to get around this (short of not asking the teams to not acknowledge alerts)?
My only thinking is to amend my plugin to check if the alert is acknowledged, remove it if so, and then send a critical status, can removing acknowledgements be done via the API?

Posted: **Mon Jan 21, 2019 1:07 pm**

I would assert that the monitoring team shouldn't be "acknowledging" a problem unless they intend to see it through to resolution. In which case, fresh alerts wouldn't matter because a human is paying specific attention to the problem state waiting for it to recover.

There are 2 external commands for removing acknowledgements for a particular check:
REMOVE_HOST_ACKNOWLEDGEMENT
REMOVE_SVC_ACKNOWLEDGEMENT

If your plugin is running locally relative to your Nagios XI machine, those external commands are one option.

If this plugin is executed remotely (via NRPE, NCPA, NSClient++, etc) I don't believe the API for XI currently supports those external commands, but an NRDP endpoint could. More info on NRDP within Nagios XI (see Page 4 for command examples):
https://assets.nagios.com/downloads/nag ... erview.pdf

Posted: **Mon Jan 21, 2019 5:44 pm**

@JGCG, I agree with @mcapra. Perhaps you should increase the notification interval for this service? That way if a team member receives an email notification he will restart the job and by the time another email notification will be due, the job will either recover to OK state or still be critical and send another notification(as it should). Not sure if I 100% understand your set up.

Posted: **Tue Jan 22, 2019 5:45 am**

Thanks guys. I'll take a look at the solutions mentioned; this can be resolved.

Nagios Support Forum

Remove Acknowledement via API

Remove Acknowledement via API

Re: Remove Acknowledement via API

Re: Remove Acknowledement via API

Re: Remove Acknowledement via API