prevent first critical notification

Nemo1987 · Post by **Nemo1987** » Mon Mar 06, 2017 6:48 am

Hi,

we have a problem with notifications in nagios. We use a service template which is set to send a sms notification for critical status only. So, if the status is "warning" for a longer period, the state changes to "HARD", which is okay. If then ONE check attempt is "critical" (for example network timeout), the notification is send out immediately. How can we prevent nagios from sending this notification? If the status remains "critical" the sms (notification) should be send out of course! But not until there is a second check, that returns "critical".

Can anybody help me doing this for ALL services? I tried to configure this for one service and this seems to work. I set up a serviceescalation that sets the notification_interval to 2 minutes, then after this the next step is to inform the admins via sms:

define serviceescalation{
hostgroup check_command
contact_groups admins
service_description CHECKNAME
first_notification 1
last_notification 1
notification_interval 2
}
define serviceescalation{
hostgroup check_command
contact_groups admins_sms
service_description CHECKNAME
first_notification 2
}

Post by **mcapra** » Mon Mar 06, 2017 11:16 am

Rather than looking at this through the lens of limiting notifications, it's more appropriate to better define what is a "HARD" vs a "SOFT" state for a given check. Nagios Core only sends notifications on a "HARD" state, so what you should instead be doing is adjusting the max_check_attempts, check_interval, and retry_interval directives of your host/service definitions to be more fine-tuned for the behavior you'd like to see.

More explanation on HARD vs SOFT states:
https://assets.nagios.com/downloads/nag ... types.html

Nemo1987 · Post by **Nemo1987** » Wed Mar 08, 2017 9:13 am

Yes, thank you. But I really think this is a problem. For example:

The partition /var is filled 85%, what means a warning status is generated. This happens in the evening at 11pm. Then at 4am (everyone sleeps

), ONE check fails with critical, because of a timeout. Immediately the admin gets a sms. I think this is not a reason to wake somebody up. It would be better to wait for a second critical return. Then it makes sense to wake the admin up.

Post by **mcapra** » Wed Mar 08, 2017 4:32 pm

I agree completely with the proposed scenario and outcomes, but I'm saying that it's better to tell Nagios Core what really is a "problem" state (or a "HARD" state) rather than to try and adjust anything happening on the notification end of the process. If you tell Nagios Core "this is only *really* a problem when the critical happens twice", then the notification won't be dispatched because the state is not a "HARD CRITICAL"; It is a "SOFT CRITICAL" until such a time when 2 criticals have occurred back-to-back.

For example, the following service will only be in a "HARD" state when 5 problems states were found in a row (as per the max_check_attempts directive):

Code: Select all

define service {
        host_name                       192.168.67.99
        service_description             Drive C: Disk Usage
        check_command                   check_xi_service_wmiplus!'admin'!'welcome123'!checkdrivesize!-a 'C': -w '80' -c '95'
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        contacts                        nagiosadmin
        }

This means that, when a problem is first detected, it will require 5 minutes (retry_interval * max_check_attempts) of a "HARD" problem state before a notification is dispatched. If I wanted to double that time before a "HARD" state (and thus a notification) is triggered, I could simply increase the max_check_attempts to double the original number:

Code: Select all

define service {
        host_name                       192.168.67.99
        service_description             Drive C: Disk Usage
        check_command                   check_xi_service_wmiplus!'admin'!'welcome123'!checkdrivesize!-a 'C': -w '80' -c '95'
        max_check_attempts              10
        check_interval                  5
        retry_interval                  1
        contacts                        nagiosadmin
        }

This is effectively the same exact thing as saying "ignore the first critical, only let me know if it's really really a problem and I get a second critical".

Nemo1987 · Post by **Nemo1987** » Thu Mar 09, 2017 5:19 am

Thank you very much. But I think you did not understand 100% of my problem

Or maybe I explained it wrong!

I even completely agree with your post. But what I meant was:

11pm: a WARNING HARD state is shown in nagios because at 10:56pm the service became WARNING, five retrys were made, HARD status is reached, an email notification is sent (that is OF COURSE desired because it makes sense to check this later, even if the problem recovers in the night where nobody sees this)
0am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
1am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
2am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
3am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work

Now the problem:

4am: one check fails because of a critical status (network timeout), immediately the admin gets a sms. The next check is again a warning, so it would be better that the admin gets a sms when there is really, really a problem, not just because one check fails...

Or did I get you wrong??? If so, I am sorry, but then I really did not understand what you mean. I think in your example the same thing would happen, wouldn't it?

avandemore · Post by **avandemore** » Thu Mar 09, 2017 5:59 pm

How about first_notification_delay?

https://assets.nagios.com/downloads/nag ... .html#host

This will impose a timeperiod starting from the last time an OK was recorded for the object.

Nagios Support Forum

prevent first critical notification

prevent first critical notification

Re: prevent first critical notification

Re: prevent first critical notification

Re: prevent first critical notification

Re: prevent first critical notification

Re: prevent first critical notification