I'm trying to figure out why one of my service checks does not send a notification after it hard fails. I've crossed checked the alert config against other services, and it's configured the same. Any ideas on what to check?
Service Ok[07-22-2020 11:32:16] SERVICE ALERT: servername;Server Memory Usage;OK;SOFT;4;OK: physical: Total: 24GB - Used: 16.767GB (69%) - Free: 7.233GB (30%)
Service Critical[07-22-2020 11:31:17] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;3;CRITICAL: physical: Total: 24GB - Used: 23.819GB (99%) - Free: 184.332MB (0%)
Service Critical[07-22-2020 11:30:18] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;2;CRITICAL: physical: Total: 24GB - Used: 23.83GB (99%) - Free: 173.965MB (0%)
Service Critical[07-22-2020 11:29:19] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;1;CRITICAL: physical: Total: 24GB - Used: 23.826GB (99%) - Free: 177.676MB (0%)
Service Ok[07-22-2020 11:24:21] SERVICE ALERT: servername;Server Memory Usage;OK;HARD;5;OK: physical: Total: 24GB - Used: 18.628GB (77%) - Free: 5.371GB (22%)
Service Critical[07-22-2020 11:14:23] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;HARD;5;CRITICAL: physical: Total: 24GB - Used: 23.873GB (99%) - Free: 129.809MB (0%)
Service Critical[07-22-2020 11:13:25] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;4;CRITICAL: physical: Total: 24GB - Used: 23.936GB (99%) - Free: 65.438MB (0%)
Service Critical[07-22-2020 11:12:26] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;3;CRITICAL: physical: Total: 24GB - Used: 23.918GB (99%) - Free: 83.621MB (0%)
Service Critical[07-22-2020 11:11:27] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;2;CRITICAL: physical: Total: 24GB - Used: 23.924GB (99%) - Free: 77.359MB (0%)
Service Critical[07-22-2020 11:10:28] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;SOFT;1;CRITICAL: physical: Total: 24GB - Used: 22.92GB (95%) - Free: 1.08GB (4%)
Notifications not triggering on hard fail
Notifications not triggering on hard fail
You do not have the required permissions to view the files attached to this post.
Re: Notifications not triggering on hard fail
There could be many reasons why a notification is not being sent. Was the host down at the time the service was in a critical state? Were notifications disabled in the GUI? Is this service assigned to a "contact only" or a "xi user, who is also a contact"? If this is a "xi user", what are the user's notification preferences? Perhaps the "Service Critical" check-box was not selected.
I would recommend that you start troubleshooting the issue by following the steps, outlined in the KB article below:
https://support.nagios.com/kb/article/n ... ms-36.html
Let us know if this helped. Thank you!
I would recommend that you start troubleshooting the issue by following the steps, outlined in the KB article below:
https://support.nagios.com/kb/article/n ... ms-36.html
Let us know if this helped. Thank you!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Notifications not triggering on hard fail
Notifications were not disabled in the GUI and it's being sent to a contact group that other service checks use. I'm wondering if it has to do with the delay on first notification and the timing of the issue because it hard failed but then recovered. If the notification delay is set for 5 minutes, does it wait another 5 min (next check interval) before sending the alert?
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Notifications not triggering on hard fail
If you have a notification delay set, and the host/service recovers before the notification delay is reached, then no notification is sent, this is the whole purpose of the delay.
-
swolf
Re: Notifications not triggering on hard fail
Hi @nickap,
If I'm not mistaken, the lines you're confused about are
with the confusion coming from the fact that there's ~10 minutes between the CRITICAL HARD and the OK HARD recovery, when the notification delay was only set to 5 minutes.
The way first_notification_delay works is that it causes a minimum delay of however many minutes you set, and it will notify on the next check after the interval. In this case, it looks to me like the checks were run just slightly less than 5 minutes apart, i.e.:
11:14:23 - First CRITICAL HARD occurs, notification delay begins.
11:19:22 - Second CRITICAL HARD occurs. Notification delay was 5 minutes, this happened 4:59 later, so no notification was sent.
11:24:21 - Service recovers to OK HARD. The notification delay was passed, but this check is OK, so no notification was sent.
For now, I'd say this is working as intended - Nagios Core is allowed to shift check times slightly (within a few seconds) in order to spread out the CPU load on the system. If you want to make sure it always sends the notification on the second CRITICAL HARD alert, you can change the notification delay to be slightly shorter (e.g. 3 or 4), or if you want an immediate notification on a CRITICAL HARD alert, you can set it to 0.
Please let us know if you need further clarification, or have any further questions or concerns.
If I'm not mistaken, the lines you're confused about are
Code: Select all
Service Ok[07-22-2020 11:24:21] SERVICE ALERT: servername;Server Memory Usage;OK;HARD;5;OK: physical: Total: 24GB - Used: 18.628GB (77%) - Free: 5.371GB (22%)
Service Critical[07-22-2020 11:14:23] SERVICE ALERT: servername;Server Memory Usage;CRITICAL;HARD;5;CRITICAL: physical: Total: 24GB - Used: 23.873GB (99%) - Free: 129.809MB (0%)
The way first_notification_delay works is that it causes a minimum delay of however many minutes you set, and it will notify on the next check after the interval. In this case, it looks to me like the checks were run just slightly less than 5 minutes apart, i.e.:
11:14:23 - First CRITICAL HARD occurs, notification delay begins.
11:19:22 - Second CRITICAL HARD occurs. Notification delay was 5 minutes, this happened 4:59 later, so no notification was sent.
11:24:21 - Service recovers to OK HARD. The notification delay was passed, but this check is OK, so no notification was sent.
For now, I'd say this is working as intended - Nagios Core is allowed to shift check times slightly (within a few seconds) in order to spread out the CPU load on the system. If you want to make sure it always sends the notification on the second CRITICAL HARD alert, you can change the notification delay to be slightly shorter (e.g. 3 or 4), or if you want an immediate notification on a CRITICAL HARD alert, you can set it to 0.
Please let us know if you need further clarification, or have any further questions or concerns.