Alerts not triggered

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
altsysrq
Posts: 17
Joined: Thu Feb 26, 2015 12:35 pm

Alerts not triggered

Post by altsysrq »

We have a few services that do not trigger alerts when the host service has failed. These service monitors were fully functioning at some point and would send alerts when there was a failure. Recently we discovered this issue when a host was completely down and only a few alerts came in for some of the service monitors for that host. We can delete the service monitor for the non-alerting services, recreate them, and they will work at sending alerts when there is a failure.

My concern is whether other service monitors that we have not discovered could have a similar issue. We do not want to go through the process of testing/deleting/creating all of them.

Tests:
  • Submitting a passive check result of CRITICAL on any service and we get an alert
    Forcing a failure on the node the service monitor is checking for does not trigger an alert (after the allocated number of times to check before generating an alert)
    Some services for the same host will send an alert when there is a failure
    Services that are able to send alerts show attempts to do so in the Nagios and email logs
    Services that are not able to send alerts do not show attempts to do so in the Nagios and email logs
    Enabling debug for Nagios did not reveal any attempt for the problem service to trigger an alert (using tail -F /usr/local/nagios/var/nagios.debug)
    Deleting a service monitor and recreating it will create a fully functioning service monitor that alerts us when there is a failure
Configuration:
  • Services, that send alerts vs services that do not, appear to be configured the same
    Service monitors are set to check every 5 minutes
    When a problem is detected, check every 1 minute for 5 minutes before sending an alert
    Notifications go to the same groups when comparing service monitors that send alerts and those that do not
Services Core configuration (same for alerting an non-alerting):
  • Manage hosts are the same and use the same host
    Templates: xiwizard_website_http_content_service
    Manage host groups: 0
    Manage service groups: 1 (tried removing this so it was 0, thinking the group might block alerts, no luck when testing non-alerting service monitors)
    Active: checked
    Initial State: <none selected>
    Check interval: 5
    Retry interval: 1
    Max attempts: 5
    Active checks enabled: Skip
    Passive checks enabled: Skip
    Check period: xi_timeperiod_24x7
    Freshness threshold: <blank>
    Check freshness: Skip
    Obsess over service: Skip
    Event handler: <blank>
    Event handler enabled: Skip
    Low flap threshold: <blank>
    High flap threshold: <blank>
    Flap detection enabled: Skip
    Flap detection options: <none selected>
    Retain Satus information: Skip
    Retain non-status information: Skip
    Process perf data: Skip
    Is volatile: Skip
    Manage Contacts: same for both services
    Manage contact groups: none
    Notification period: xi_timeperiod_24x7
    Notification options: Warning, Critical, Unknown, Recovery, Flapping, Scheduled Downtime
    Notification interval: 60
    First notification delay: 0
    Notification enabled: On (we have tested Skip here and services that send alerts continue to function, those that do not continue to not send alerts)
    Stalking options: <none selected>
Software and operating system:
  • Nagios XI 5.5.8
    RedHat 7.6
I have walked through and followed most of the information here: Is there any more troubleshooting we can do to find the issue? Is there a way to provide a fix for all unknown problem services?
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Alerts not triggered

Post by npolovenko »

@altsysrq, When a service doesn't send the notification, does it go into a hard state or it stays in a soft state? Please generate a state history report for the service when the notification was supposed to go out and see if the service was stuck in a soft state and not getting into a hard state.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
altsysrq
Posts: 17
Joined: Thu Feb 26, 2015 12:35 pm

Re: Alerts not triggered

Post by altsysrq »

There does not seem to be any items added in the Service State History for the service when I create a failure. There are leftover items from my previous day's testing that correlate with some of the passive check results I created on the service. The items that are reported are all reported as HARD for the state type, even with items that the state is reported as OK or CRITICAL.

2018-12-18 15:22:43 node1 Web Page Content CRITICAL HARD 1 of 5 connect to address 192.168.1.88 and port 80: No route to host
2018-12-18 14:20:41 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.009 second response time
2018-12-18 14:20:26 node1 Web Page Content CRITICAL HARD 1 of 1 NOK
2018-12-18 14:12:33 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.006 second response time
2018-12-18 14:12:02 node1 Web Page Content CRITICAL HARD 1 of 1 CRITICAL
2018-12-18 14:08:46 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.007 second response time

One thing worth noting, and I failed to mention, is the there were plenty of indexes repaired when I ran /usr/local/nagiosxi/scripts/repair_databases.sh the first time, which resulted in REPAIR COMPLETE. If I ran it multiple times after that it would output the same messages; that it found indexes that needed repair, it would fix them, and then would report repair complete. I included the output of the repair_database script.
Attachments
repair_database.txt
(19.12 KiB) Downloaded 121 times
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Alerts not triggered

Post by tgriep »

In earlier versions of XI 5.5.x, there was an issue that affected the Notification Counters that Nagios uses to determine that a Notification should be sent.
If you had an older version of XI 5.5.x running on the server, that would cause the issue you are having.
The way you describe how the notifications are working, it sounds like you were runing an older version and the system had the counter bug.

You posted that you are running the latest version of XI which is good but the older retention.dat file needs to be removed so the counters can be rebuilt,

To remove the file, run the following as root

Code: Select all

service nagios stop
rm /usr/local/nagios/var/retention.dat
service nagios start
Warning, doing this will remove all of the Notes, manual Downtime schedules and cause the system to retest all of the hosts and services.

Try that out and let us know if the notifications start to work.


The index fix message is OK as the command that repairs the tables re-indexes them and as long as the repair run without any errors, that message is OK.
Be sure to check out our Knowledgebase for helpful articles and solutions!
altsysrq
Posts: 17
Joined: Thu Feb 26, 2015 12:35 pm

Re: Alerts not triggered

Post by altsysrq »

tgriep,

Thank you. That worked very well.

Sorry for the delay. I was on vacation for a few weeks and actually ended up drastically breaking our Nagios server for a few hours yesterday (separate issue).
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Alerts not triggered

Post by tgriep »

Your welcome. Glad the server is working now. I'll close and lock the post for you but feel free to open a new ticket in the future for any further questions.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked