Global Event Handler Continuing to alert on service down

kpascoe · Post by **kpascoe** » Tue Jul 31, 2018 8:00 am

Since upgrading to NagiosXI 5.5.2, we are seeing some strange things.

We use the Global Event Handlers to send HOST and SERVICE changes to a script that interfaces with an in-house alerting system. This has been working fine for ages.

Since upgrading, when a node goes down, we receive an alert for the host going down once, then we receive an alert for the Ping service being down every minute, even after the maximum of 5 checks have failed (The max_checks is set to 5)

I have attached our profile.

Regards

Kev

scottwilkerson · Post by **scottwilkerson** » Tue Jul 31, 2018 11:40 am

I believe this could be caused by a bug in Core that is being worked on
https://github.com/NagiosEnterprises/na ... issues/557

I added this thread to the issue so you can be notified once the issue is resolved.

scottwilkerson · Post by **scottwilkerson** » Thu Aug 02, 2018 7:19 am

I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint

Code: Select all

wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
    configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install

service nagios restart

After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck

kpascoe · Post by **kpascoe** » Fri Aug 03, 2018 2:33 am

I have compiled and installed the maint branch with no joy. I think the issue I am seeing is different. The service is not failing to stop notifying when it returns to an OK state. The issue is that it keeps notifying, despite reaching the Max Checks of 5

Out global event handler calls this for service change

Code: Select all

/usr/local/nagios_scripts/service_change_handler.sh "%host%" "%hoststate%" "%service%" "%servicestate%" "%serviceoutput%" "%currentattempt%" "%maxattempts%"

And as you can see from our scripts logs, if I use the force immediate check link in nagios, each time I press it, a service change is detected, even when the max attempts of 5 has been reached. This didn't used to happen, we would stop getting service change alerts after the 5th time

Code: Select all

Fri Aug  3 08:26:07 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "UP". Attempt No. 1 of max of 5

Fri Aug  3 08:26:07 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 2 of max of 5

Fri Aug  3 08:26:28 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 3 of max of 5

Fri Aug  3 08:26:43 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 4 of max of 5

Fri Aug  3 08:27:08 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL: No response from NTP server" nagiosxi_service logfile

Error Deleting /etc/weccache

Fri Aug  3 08:27:32 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile

Error Deleting /etc/weccache

Fri Aug  3 08:27:53 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile

Error Deleting /etc/weccache

Is this something I will now have to code around, or is there a bug?

scottwilkerson · Post by **scottwilkerson** » Fri Aug 03, 2018 12:22 pm

There was 1 more commit to the maint branch last night that i believe fixes this issue.

I setup a test and once a service goes HARD it doesn't keep firing the global event handler.

It is normal for it execute the global event handler for each retry while in a SOFT state, and then the HARD state.

kpascoe · Post by **kpascoe** » Mon Aug 06, 2018 5:39 am

I've tried the latest maint patch again. Still no joy.

I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard

scottwilkerson · Post by **scottwilkerson** » Mon Aug 06, 2018 7:08 am

kpascoe wrote:I've tried the latest maint patch again. Still no joy.

I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard

what are you running to install the latest maint branch?

Did you restart Nagios?

I also should mention that the services will have to go into an OK state for the changes in the branch to take affect, so back to ok, then should be marked HARD after hitting 5 of 5

kpascoe · Post by **kpascoe** » Fri Aug 10, 2018 3:55 am

scottwilkerson wrote:I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint
Code: Select all
wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
    configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install

service nagios restart
After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck

Sorry for the delay in getting back

As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.

Nagios was restarted (In fact I restarted the entire server) and the service was in an OK state before I made it fail again

scottwilkerson · Post by **scottwilkerson** » Fri Aug 10, 2018 7:05 am

kpascoe wrote: Sorry for the delay in getting back

As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.

This should be fine

Nagios Support Forum

Global Event Handler Continuing to alert on service down

Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down

Re: Global Event Handler Continuing to alert on service down