Alerts not triggered

altsysrq · Post by **altsysrq** » Wed Dec 19, 2018 11:27 am

We have a few services that do not trigger alerts when the host service has failed. These service monitors were fully functioning at some point and would send alerts when there was a failure. Recently we discovered this issue when a host was completely down and only a few alerts came in for some of the service monitors for that host. We can delete the service monitor for the non-alerting services, recreate them, and they will work at sending alerts when there is a failure.

My concern is whether other service monitors that we have not discovered could have a similar issue. We do not want to go through the process of testing/deleting/creating all of them.

Tests:

Submitting a passive check result of CRITICAL on any service and we get an alert
Forcing a failure on the node the service monitor is checking for does not trigger an alert (after the allocated number of times to check before generating an alert)
Some services for the same host will send an alert when there is a failure
Services that are able to send alerts show attempts to do so in the Nagios and email logs
Services that are not able to send alerts do not show attempts to do so in the Nagios and email logs
Enabling debug for Nagios did not reveal any attempt for the problem service to trigger an alert (using tail -F /usr/local/nagios/var/nagios.debug)
Deleting a service monitor and recreating it will create a fully functioning service monitor that alerts us when there is a failure

Configuration:

Services, that send alerts vs services that do not, appear to be configured the same
Service monitors are set to check every 5 minutes
When a problem is detected, check every 1 minute for 5 minutes before sending an alert
Notifications go to the same groups when comparing service monitors that send alerts and those that do not

Services Core configuration (same for alerting an non-alerting):

Manage hosts are the same and use the same host
Templates: xiwizard_website_http_content_service
Manage host groups: 0
Manage service groups: 1 (tried removing this so it was 0, thinking the group might block alerts, no luck when testing non-alerting service monitors)
Active: checked
Initial State: <none selected>
Check interval: 5
Retry interval: 1
Max attempts: 5
Active checks enabled: Skip
Passive checks enabled: Skip
Check period: xi_timeperiod_24x7
Freshness threshold: <blank>
Check freshness: Skip
Obsess over service: Skip
Event handler: <blank>
Event handler enabled: Skip
Low flap threshold: <blank>
High flap threshold: <blank>
Flap detection enabled: Skip
Flap detection options: <none selected>
Retain Satus information: Skip
Retain non-status information: Skip
Process perf data: Skip
Is volatile: Skip
Manage Contacts: same for both services
Manage contact groups: none
Notification period: xi_timeperiod_24x7
Notification options: Warning, Critical, Unknown, Recovery, Flapping, Scheduled Downtime
Notification interval: 60
First notification delay: 0
Notification enabled: On (we have tested Skip here and services that send alerts continue to function, those that do not continue to not send alerts)
Stalking options: <none selected>

Software and operating system:

Nagios XI 5.5.8
RedHat 7.6

I have walked through and followed most of the information here:

https://support.nagios.com/forum/viewto ... cb177af7a2
https://assets.nagios.com/downloads/nag ... tabase.pdf

Is there any more troubleshooting we can do to find the issue? Is there a way to provide a fix for all unknown problem services?

npolovenko · Post by **npolovenko** » Wed Dec 19, 2018 4:37 pm

@altsysrq, When a service doesn't send the notification, does it go into a hard state or it stays in a soft state? Please generate a state history report for the service when the notification was supposed to go out and see if the service was stuck in a soft state and not getting into a hard state.

altsysrq · Post by **altsysrq** » Wed Dec 19, 2018 5:49 pm

There does not seem to be any items added in the Service State History for the service when I create a failure. There are leftover items from my previous day's testing that correlate with some of the passive check results I created on the service. The items that are reported are all reported as HARD for the state type, even with items that the state is reported as OK or CRITICAL.

2018-12-18 15:22:43 node1 Web Page Content CRITICAL HARD 1 of 5 connect to address 192.168.1.88 and port 80: No route to host
2018-12-18 14:20:41 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.009 second response time
2018-12-18 14:20:26 node1 Web Page Content CRITICAL HARD 1 of 1 NOK
2018-12-18 14:12:33 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.006 second response time
2018-12-18 14:12:02 node1 Web Page Content CRITICAL HARD 1 of 1 CRITICAL
2018-12-18 14:08:46 node1 Web Page Content OK HARD 1 of 1 HTTP OK: HTTP/1.1 200 OK - 5141 bytes in 0.007 second response time

One thing worth noting, and I failed to mention, is the there were plenty of indexes repaired when I ran /usr/local/nagiosxi/scripts/repair_databases.sh the first time, which resulted in REPAIR COMPLETE. If I ran it multiple times after that it would output the same messages; that it found indexes that needed repair, it would fix them, and then would report repair complete. I included the output of the repair_database script.

Post by **tgriep** » Thu Dec 20, 2018 1:00 pm

In earlier versions of XI 5.5.x, there was an issue that affected the Notification Counters that Nagios uses to determine that a Notification should be sent.
If you had an older version of XI 5.5.x running on the server, that would cause the issue you are having.
The way you describe how the notifications are working, it sounds like you were runing an older version and the system had the counter bug.

You posted that you are running the latest version of XI which is good but the older retention.dat file needs to be removed so the counters can be rebuilt,

To remove the file, run the following as root

Code: Select all

service nagios stop
rm /usr/local/nagios/var/retention.dat
service nagios start

Warning, doing this will remove all of the Notes, manual Downtime schedules and cause the system to retest all of the hosts and services.

Try that out and let us know if the notifications start to work.

The index fix message is OK as the command that repairs the tables re-indexes them and as long as the repair run without any errors, that message is OK.

altsysrq · Post by **altsysrq** » Wed Jan 09, 2019 12:12 pm

tgriep,

Thank you. That worked very well.

Sorry for the delay. I was on vacation for a few weeks and actually ended up drastically breaking our Nagios server for a few hours yesterday (separate issue).

Post by **tgriep** » Wed Jan 09, 2019 1:42 pm

Your welcome. Glad the server is working now. I'll close and lock the post for you but feel free to open a new ticket in the future for any further questions.

Nagios Support Forum

Alerts not triggered

Alerts not triggered

Re: Alerts not triggered

Re: Alerts not triggered

Re: Alerts not triggered

Re: Alerts not triggered

Re: Alerts not triggered