Global Event Handler Continuing to alert on service down
Global Event Handler Continuing to alert on service down
Since upgrading to NagiosXI 5.5.2, we are seeing some strange things.
We use the Global Event Handlers to send HOST and SERVICE changes to a script that interfaces with an in-house alerting system. This has been working fine for ages.
Since upgrading, when a node goes down, we receive an alert for the host going down once, then we receive an alert for the Ping service being down every minute, even after the maximum of 5 checks have failed (The max_checks is set to 5)
I have attached our profile.
Regards
Kev
We use the Global Event Handlers to send HOST and SERVICE changes to a script that interfaces with an in-house alerting system. This has been working fine for ages.
Since upgrading, when a node goes down, we receive an alert for the host going down once, then we receive an alert for the Ping service being down every minute, even after the maximum of 5 checks have failed (The max_checks is set to 5)
I have attached our profile.
Regards
Kev
Last edited by kpascoe on Fri Aug 03, 2018 2:06 am, edited 1 time in total.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Global Event Handler Continuing to alert on service down
I believe this could be caused by a bug in Core that is being worked on
https://github.com/NagiosEnterprises/na ... issues/557
I added this thread to the issue so you can be notified once the issue is resolved.
https://github.com/NagiosEnterprises/na ... issues/557
I added this thread to the issue so you can be notified once the issue is resolved.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Global Event Handler Continuing to alert on service down
I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint
After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
https://github.com/NagiosEnterprises/na ... ee/maint
Code: Select all
wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install
service nagios restartAfter this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
Re: Global Event Handler Continuing to alert on service down
I have compiled and installed the maint branch with no joy. I think the issue I am seeing is different. The service is not failing to stop notifying when it returns to an OK state. The issue is that it keeps notifying, despite reaching the Max Checks of 5
Out global event handler calls this for service change
And as you can see from our scripts logs, if I use the force immediate check link in nagios, each time I press it, a service change is detected, even when the max attempts of 5 has been reached. This didn't used to happen, we would stop getting service change alerts after the 5th time
Is this something I will now have to code around, or is there a bug?
Out global event handler calls this for service change
Code: Select all
/usr/local/nagios_scripts/service_change_handler.sh "%host%" "%hoststate%" "%service%" "%servicestate%" "%serviceoutput%" "%currentattempt%" "%maxattempts%"Code: Select all
Fri Aug 3 08:26:07 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "UP". Attempt No. 1 of max of 5
Fri Aug 3 08:26:07 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 2 of max of 5
Fri Aug 3 08:26:28 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 3 of max of 5
Fri Aug 3 08:26:43 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 4 of max of 5
Fri Aug 3 08:27:08 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5
Service not ok after 5 attempts. Notifiying Tec!
/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL: No response from NTP server" nagiosxi_service logfile
Error Deleting /etc/weccache
Fri Aug 3 08:27:32 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5
Service not ok after 5 attempts. Notifiying Tec!
/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile
Error Deleting /etc/weccache
Fri Aug 3 08:27:53 BST 2018
Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5
Service not ok after 5 attempts. Notifiying Tec!
/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile
Error Deleting /etc/weccache-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Global Event Handler Continuing to alert on service down
There was 1 more commit to the maint branch last night that i believe fixes this issue.
I setup a test and once a service goes HARD it doesn't keep firing the global event handler.
It is normal for it execute the global event handler for each retry while in a SOFT state, and then the HARD state.
I setup a test and once a service goes HARD it doesn't keep firing the global event handler.
It is normal for it execute the global event handler for each retry while in a SOFT state, and then the HARD state.
Re: Global Event Handler Continuing to alert on service down
I've tried the latest maint patch again. Still no joy.
I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard
I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Global Event Handler Continuing to alert on service down
what are you running to install the latest maint branch?kpascoe wrote:I've tried the latest maint patch again. Still no joy.
I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard
Did you restart Nagios?
I also should mention that the services will have to go into an OK state for the changes in the branch to take affect, so back to ok, then should be marked HARD after hitting 5 of 5
Re: Global Event Handler Continuing to alert on service down
Sorry for the delay in getting backscottwilkerson wrote:I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint
Code: Select all
wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz tar xzf maint.tar.gz cd nagioscore-maint configureflags="--with-command-group=nagcmd" if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then configureflags="--with-init-type=sysv $configureflags" fi ./configure "$configureflags" make -j 2 all make install service nagios restart
After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.
Nagios was restarted (In fact I restarted the entire server) and the service was in an OK state before I made it fail again
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Global Event Handler Continuing to alert on service down
This should be finekpascoe wrote: Sorry for the delay in getting back
As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.