Page 1 of 1

Notifications not triggering on hard fail

Posted: Wed Jul 22, 2020 11:20 am
by nickap
I'm trying to figure out why one of my service checks does not send a notification after it hard fails. I've crossed checked the alert config against other services, and it's configured the same. Any ideas on what to check?


Service Ok[07-22-2020 11:32:16] SERVICE ALERT: servername;Server Memory Usage;OK;SOFT;4;OK: physical: Total: 24GB - Used: 16.767GB (69%) - Free: 7.233GB (30%)
Service Critical[07-22-2020 11:31:17] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;3;CRITICAL: physical: Total: 24GB - Used: 23.819GB (99%) - Free: 184.332MB (0%)
Service Critical[07-22-2020 11:30:18] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;2;CRITICAL: physical: Total: 24GB - Used: 23.83GB (99%) - Free: 173.965MB (0%)
Service Critical[07-22-2020 11:29:19] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;1;CRITICAL: physical: Total: 24GB - Used: 23.826GB (99%) - Free: 177.676MB (0%)
Service Ok[07-22-2020 11:24:21] SERVICE ALERT: servername;Server Memory Usage;OK;HARD;5;OK: physical: Total: 24GB - Used: 18.628GB (77%) - Free: 5.371GB (22%)
Service Critical[07-22-2020 11:14:23] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;HARD;5;CRITICAL: physical: Total: 24GB - Used: 23.873GB (99%) - Free: 129.809MB (0%)
Service Critical[07-22-2020 11:13:25] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;4;CRITICAL: physical: Total: 24GB - Used: 23.936GB (99%) - Free: 65.438MB (0%)
Service Critical[07-22-2020 11:12:26] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;3;CRITICAL: physical: Total: 24GB - Used: 23.918GB (99%) - Free: 83.621MB (0%)
Service Critical[07-22-2020 11:11:27] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;2;CRITICAL: physical: Total: 24GB - Used: 23.924GB (99%) - Free: 77.359MB (0%)
Service Critical[07-22-2020 11:10:28] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;1;CRITICAL: physical: Total: 24GB - Used: 22.92GB (95%) - Free: 1.08GB (4%)

Re: Notifications not triggering on hard fail

Posted: Thu Jul 23, 2020 8:50 am
by lmiltchev
There could be many reasons why a notification is not being sent. Was the host down at the time the service was in a critical state? Were notifications disabled in the GUI? Is this service assigned to a "contact only" or a "xi user, who is also a contact"? If this is a "xi user", what are the user's notification preferences? Perhaps the "Service Critical" check-box was not selected.

I would recommend that you start troubleshooting the issue by following the steps, outlined in the KB article below:

https://support.nagios.com/kb/article/n ... ms-36.html

Let us know if this helped. Thank you!

Re: Notifications not triggering on hard fail

Posted: Thu Jul 23, 2020 11:57 am
by nickap
Notifications were not disabled in the GUI and it's being sent to a contact group that other service checks use. I'm wondering if it has to do with the delay on first notification and the timing of the issue because it hard failed but then recovered. If the notification delay is set for 5 minutes, does it wait another 5 min (next check interval) before sending the alert?

Re: Notifications not triggering on hard fail

Posted: Thu Jul 23, 2020 12:19 pm
by scottwilkerson
If you have a notification delay set, and the host/service recovers before the notification delay is reached, then no notification is sent, this is the whole purpose of the delay.

Re: Notifications not triggering on hard fail

Posted: Thu Jul 23, 2020 12:46 pm
by swolf
Hi @nickap,

If I'm not mistaken, the lines you're confused about are

Code: Select all

Service Ok[07-22-2020 11:24:21] SERVICE ALERT: servername;Server Memory Usage;OK;HARD;5;OK: physical: Total: 24GB - Used: 18.628GB (77%) - Free: 5.371GB (22%)
Service Critical[07-22-2020 11:14:23] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;HARD;5;CRITICAL: physical: Total: 24GB - Used: 23.873GB (99%) - Free: 129.809MB (0%)
with the confusion coming from the fact that there's ~10 minutes between the CRITICAL HARD and the OK HARD recovery, when the notification delay was only set to 5 minutes.

The way first_notification_delay works is that it causes a minimum delay of however many minutes you set, and it will notify on the next check after the interval. In this case, it looks to me like the checks were run just slightly less than 5 minutes apart, i.e.:
11:14:23 - First CRITICAL HARD occurs, notification delay begins.
11:19:22 - Second CRITICAL HARD occurs. Notification delay was 5 minutes, this happened 4:59 later, so no notification was sent.
11:24:21 - Service recovers to OK HARD. The notification delay was passed, but this check is OK, so no notification was sent.

For now, I'd say this is working as intended - Nagios Core is allowed to shift check times slightly (within a few seconds) in order to spread out the CPU load on the system. If you want to make sure it always sends the notification on the second CRITICAL HARD alert, you can change the notification delay to be slightly shorter (e.g. 3 or 4), or if you want an immediate notification on a CRITICAL HARD alert, you can set it to 0.

Please let us know if you need further clarification, or have any further questions or concerns.