Hi,
we have a problem with notifications in nagios. We use a service template which is set to send a sms notification for critical status only. So, if the status is "warning" for a longer period, the state changes to "HARD", which is okay. If then ONE check attempt is "critical" (for example network timeout), the notification is send out immediately. How can we prevent nagios from sending this notification? If the status remains "critical" the sms (notification) should be send out of course! But not until there is a second check, that returns "critical".
Can anybody help me doing this for ALL services? I tried to configure this for one service and this seems to work. I set up a serviceescalation that sets the notification_interval to 2 minutes, then after this the next step is to inform the admins via sms:
define serviceescalation{
hostgroup check_command
contact_groups admins
service_description CHECKNAME
first_notification 1
last_notification 1
notification_interval 2
}
define serviceescalation{
hostgroup check_command
contact_groups admins_sms
service_description CHECKNAME
first_notification 2
}
prevent first critical notification
Re: prevent first critical notification
Rather than looking at this through the lens of limiting notifications, it's more appropriate to better define what is a "HARD" vs a "SOFT" state for a given check. Nagios Core only sends notifications on a "HARD" state, so what you should instead be doing is adjusting the max_check_attempts, check_interval, and retry_interval directives of your host/service definitions to be more fine-tuned for the behavior you'd like to see.
More explanation on HARD vs SOFT states:
https://assets.nagios.com/downloads/nag ... types.html
More explanation on HARD vs SOFT states:
https://assets.nagios.com/downloads/nag ... types.html
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: prevent first critical notification
Yes, thank you. But I really think this is a problem. For example:
The partition /var is filled 85%, what means a warning status is generated. This happens in the evening at 11pm. Then at 4am (everyone sleeps ), ONE check fails with critical, because of a timeout. Immediately the admin gets a sms. I think this is not a reason to wake somebody up. It would be better to wait for a second critical return. Then it makes sense to wake the admin up.
The partition /var is filled 85%, what means a warning status is generated. This happens in the evening at 11pm. Then at 4am (everyone sleeps ), ONE check fails with critical, because of a timeout. Immediately the admin gets a sms. I think this is not a reason to wake somebody up. It would be better to wait for a second critical return. Then it makes sense to wake the admin up.
Re: prevent first critical notification
I agree completely with the proposed scenario and outcomes, but I'm saying that it's better to tell Nagios Core what really is a "problem" state (or a "HARD" state) rather than to try and adjust anything happening on the notification end of the process. If you tell Nagios Core "this is only *really* a problem when the critical happens twice", then the notification won't be dispatched because the state is not a "HARD CRITICAL"; It is a "SOFT CRITICAL" until such a time when 2 criticals have occurred back-to-back.
For example, the following service will only be in a "HARD" state when 5 problems states were found in a row (as per the max_check_attempts directive):
This means that, when a problem is first detected, it will require 5 minutes (retry_interval * max_check_attempts) of a "HARD" problem state before a notification is dispatched. If I wanted to double that time before a "HARD" state (and thus a notification) is triggered, I could simply increase the max_check_attempts to double the original number:
This is effectively the same exact thing as saying "ignore the first critical, only let me know if it's really really a problem and I get a second critical".
For example, the following service will only be in a "HARD" state when 5 problems states were found in a row (as per the max_check_attempts directive):
Code: Select all
define service {
host_name 192.168.67.99
service_description Drive C: Disk Usage
check_command check_xi_service_wmiplus!'admin'!'welcome123'!checkdrivesize!-a 'C': -w '80' -c '95'
max_check_attempts 5
check_interval 5
retry_interval 1
contacts nagiosadmin
}
Code: Select all
define service {
host_name 192.168.67.99
service_description Drive C: Disk Usage
check_command check_xi_service_wmiplus!'admin'!'welcome123'!checkdrivesize!-a 'C': -w '80' -c '95'
max_check_attempts 10
check_interval 5
retry_interval 1
contacts nagiosadmin
}
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: prevent first critical notification
Thank you very much. But I think you did not understand 100% of my problem Or maybe I explained it wrong!
I even completely agree with your post. But what I meant was:
11pm: a WARNING HARD state is shown in nagios because at 10:56pm the service became WARNING, five retrys were made, HARD status is reached, an email notification is sent (that is OF COURSE desired because it makes sense to check this later, even if the problem recovers in the night where nobody sees this)
0am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
1am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
2am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
3am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
Now the problem:
4am: one check fails because of a critical status (network timeout), immediately the admin gets a sms. The next check is again a warning, so it would be better that the admin gets a sms when there is really, really a problem, not just because one check fails...
Or did I get you wrong??? If so, I am sorry, but then I really did not understand what you mean. I think in your example the same thing would happen, wouldn't it?
I even completely agree with your post. But what I meant was:
11pm: a WARNING HARD state is shown in nagios because at 10:56pm the service became WARNING, five retrys were made, HARD status is reached, an email notification is sent (that is OF COURSE desired because it makes sense to check this later, even if the problem recovers in the night where nobody sees this)
0am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
1am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
2am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
3am: still HARD WARNING, nobody has seen the problem, that's okay, the admin can handle this problem at 8am when he's at work
Now the problem:
4am: one check fails because of a critical status (network timeout), immediately the admin gets a sms. The next check is again a warning, so it would be better that the admin gets a sms when there is really, really a problem, not just because one check fails...
Or did I get you wrong??? If so, I am sorry, but then I really did not understand what you mean. I think in your example the same thing would happen, wouldn't it?
-
- Posts: 1597
- Joined: Tue Sep 27, 2016 4:57 pm
Re: prevent first critical notification
How about first_notification_delay?
https://assets.nagios.com/downloads/nag ... .html#host
This will impose a timeperiod starting from the last time an OK was recorded for the object.
https://assets.nagios.com/downloads/nag ... .html#host
This will impose a timeperiod starting from the last time an OK was recorded for the object.
Previous Nagios employee