Last night we had a host and a service on another host go down and did not receive notifications for them from Nagios XI. I only found out about them because of another host that was affected in which case Nagios XI sent the notifications like it should have. I checked at the time and they were definitely in Unhandled status and had notifications enabled and contact groups assigned. Looking at the Notifications log in the XI web interface, it is clear that Nagios XI wasn't even trying to send notifications to anyone for this host or the service in question. We have far fewer notifications in that log than normal since the upgrade to 5.5.1 from 5.4. Our Linux admin also noticed the day before that he wasn't receiving notifications for something that exhibited the same behavior as this.
Could you tell me what the next troubleshooting step should be? I'm sure there's a log somewhere I need to check to find out what's going on. Thanks!
Some notifications not firing after upgraded to 5.5.1
Some notifications not firing after upgraded to 5.5.1
=================================================
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
Re: Some notifications not firing after upgraded to 5.5.1
Well you should check out the event log for sure, and search for the host/service that was not sending out a notification. Check for the alert and see if it tried to notify. If it did though it should have shown up in the notifications section. The most obvious things to check after that are that the host/service is going to have the actual settings to be able to alert. If you want to view the actual flattened definition you can check the objects.cache (/usr/local/nagios/var/objects.cache) for the host/service definition and look for the notification_options. There was no downtime or anything occurring during this right?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Some notifications not firing after upgraded to 5.5.1
There is clearly an issue after checking the event log.
Here is a service that sends notifications properly and is configured identically from what I can see:
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;OK;HARD;1;Tomcat8=running (auto)
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;CRITICAL;HARD;5;critical(Tomcat8=stopped (auto))
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;UNKNOWN;HARD;5;Failed to open service manager: 1115: A system shutdown is in progress.
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;CRITICAL;SOFT;1;critical(Tomcat8=stopping (auto))
And here is a service that does not trigger notifications to be sent:
SERVICE ALERT: Kane;Svc - SCCM;CRITICAL;SOFT;1;critical(SMS_EXECUTIVE=stopped (auto))
It never gets past that first service alert. Same goes for another service I tried. All of these services and hosts were sending notifications right before we upgrade from 5.4.13 to 5.5.1 and I have evidence of this because of a temporary network misconfiguration on a core router that brought every single host and service down, resulting in about 14,500 notifications. So yeah, notifications working great before upgrade and not triggering after upgrade. Some of them still work, though, but I can't see a difference in the config on those.
Here is a service that sends notifications properly and is configured identically from what I can see:
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;OK;HARD;1;Tomcat8=running (auto)
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;CRITICAL;HARD;5;critical(Tomcat8=stopped (auto))
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;UNKNOWN;HARD;5;Failed to open service manager: 1115: A system shutdown is in progress.
SERVICE ALERT: Baubo;Svc - Apache Tomcat 8;CRITICAL;SOFT;1;critical(Tomcat8=stopping (auto))
And here is a service that does not trigger notifications to be sent:
SERVICE ALERT: Kane;Svc - SCCM;CRITICAL;SOFT;1;critical(SMS_EXECUTIVE=stopped (auto))
It never gets past that first service alert. Same goes for another service I tried. All of these services and hosts were sending notifications right before we upgrade from 5.4.13 to 5.5.1 and I have evidence of this because of a temporary network misconfiguration on a core router that brought every single host and service down, resulting in about 14,500 notifications. So yeah, notifications working great before upgrade and not triggering after upgrade. Some of them still work, though, but I can't see a difference in the config on those.
=================================================
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Some notifications not firing after upgraded to 5.5.1
Are the hosts down for the services that are not sending?
Re: Some notifications not firing after upgraded to 5.5.1
No, the hosts were left operational. I literally just picked a random service from a random host that was previously sending email notifications days before to test. The other service I tested was the one that I mentioned that was down recently when another related host (a shared database server) was completely down.
=================================================
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
Nagios XI 5.6.5 Enterprise
CentOS 6.10 (64-bit) VMware image
SSL implemented and forced, with exception for localhost
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Some notifications not firing after upgraded to 5.5.1
I have a feeling this may be due to a reported bug about some settings not getting refactored properly in Core
https://github.com/NagiosEnterprises/na ... issues/557
I added this thread to that ticket so when a fix is available they can notify this thread
https://github.com/NagiosEnterprises/na ... issues/557
I added this thread to that ticket so when a fix is available they can notify this thread
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Some notifications not firing after upgraded to 5.5.1
I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint
After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
https://github.com/NagiosEnterprises/na ... ee/maint
Code: Select all
wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install
service nagios restartAfter this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck