Re: [Nagios-devel] [PATCH] notifications: Fix

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] [PATCH] notifications: Fix

Post by Guest »

On 12/10/2012 06:27 PM, Jochen Bern wrote:
> On 10.12.2012 16:44, Andreas Ericsson wrote:
>> AFAIR, the original use-case [of first_notification_delay] was
>> to allow operators to react to HARD alerts and acknowledge or
>> fix them before notifications were sent out.
>
> At least, that's how some organizations *did and do* use it. And some
> probably also add the on-turning-HARD event handler execution to the mix
> of things that hopefully might make notifications unnecessary in the
> last second.
>
> FWIW, from a rather principles-oriented point of view, the sequence of
> SOFT non-OK --> HARD non-OK --> first notification (with the different
> degrees of visibility these states imply) is as much a part of the
> system of escalations as the part *called* "escalations" is. I wonder
> whether a long-term consolidation of terms and mechanisms might prove
> beneficial.
>
>> * If delaying the notification causes it to end up in a time where
>> notifications should be sent, it should be sent even if the time of
>> the alert happened during a period when no notifications should have
>> been sent.
>> * If delaying the check causes it to switch to a state which should
>> not result in a notification, no notification should be sent out.
>
> (That's how escalations *already* behave WRT earlier non- or
> lesser-escalated notifications, isn't it? Hence, The Right Thing To Do
> (tm) in my books.)
>

AFAICS, yes.

>> * Delaying a notification should not increase its notification_number,
>> and will, as such, affect both regular and escalated notifications.
>
> *Most definitely* agreed! I know several organizations which would be
> confused to no end if I had to tell them that, under certain
> circumstances, there *just was no* notification #n preceding the #n+1
> they received and try to figure out.
>
>> * Custom-, downtime-, acknowledgement and flapping notifications will
>> never be delayed (flapping is arguable, but matches current code).
>
> I am not aware, off the top of my head, of how Acknowledgment and
> Flapping notifications are supposed to behave WRT earlier notifications
> (as in "RECOVERYs are only sent to contacts who also had the PROBLEM
> sent to them"). If such a dependency does/should/will exist, whether or
> not to exempt them from first_notification_delay translates into
> potentially different sets of recipients.
>

Recovery notifications are sent only to the contacts that supposedly got
the problem notification. It doesn't always work for escalations though;
Only the current tier of escalation will get the recovery notification.
One could argue if that's correct or not, but that's how it is today at
any rate.

> For acknowledgments, sending the notification early (and to the
> *restricted* set of recipients) is likely what the person acknowledging
> the problem *wants* to happen. FWIW, same thing for Downtimes, which are
> technically prophetic acknowledgments. ;-)
>
> Customs can probably lean both ways, depending on what you use them for.
>

There's also an extra option for custom notifications, which is to let the
notifyer select how hard the notification should be forced, but I think
that's unnecessary complication. Right now, custom notifications always go
to the primary contacts and the escalated ones aren't considered.

> Flapping ....... I'll have to pass on that. The things I monitor do not
> really flap, and flapping detection is typically disabled.
>

Flapping is the only real issue, actually. It's a problem, of sorts, but
one where we by empirical evidence should flip a coin to see if we should
notify or not. I think the correct thing to do would be to wait until the
flapping ends and if it comes out in a problem state, add the time the
node was flapping to problem_duration, which is measured against the
notify_delay to see if it's time to notify or not.
That means that a service that starts flapping at 15:05, stops flapping at
15:08 and goes into non-flapping hard critical state should get the three
minutes it spent flapping discounted from the first_notification_delay.

In terms of

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ae@op5.se
Locked