Global Event Handler Continuing to alert on service down

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
kpascoe
Posts: 10
Joined: Tue Nov 27, 2012 5:40 am
Contact:

Global Event Handler Continuing to alert on service down

Post by kpascoe »

Since upgrading to NagiosXI 5.5.2, we are seeing some strange things.

We use the Global Event Handlers to send HOST and SERVICE changes to a script that interfaces with an in-house alerting system. This has been working fine for ages.

Since upgrading, when a node goes down, we receive an alert for the host going down once, then we receive an alert for the Ping service being down every minute, even after the maximum of 5 checks have failed (The max_checks is set to 5)

I have attached our profile.

Regards

Kev
Last edited by kpascoe on Fri Aug 03, 2018 2:06 am, edited 1 time in total.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by scottwilkerson »

I believe this could be caused by a bug in Core that is being worked on
https://github.com/NagiosEnterprises/na ... issues/557

I added this thread to the issue so you can be notified once the issue is resolved.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by scottwilkerson »

I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint​​

Code: Select all

wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz​
tar xzf maint.tar.gz​
cd nagioscore-maint​
configureflags="--with-command-group=​nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
    configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"​
make -j 2 all​
make install​

service nagios restart

After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
kpascoe
Posts: 10
Joined: Tue Nov 27, 2012 5:40 am
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by kpascoe »

I have compiled and installed the maint branch with no joy. I think the issue I am seeing is different. The service is not failing to stop notifying when it returns to an OK state. The issue is that it keeps notifying, despite reaching the Max Checks of 5

Out global event handler calls this for service change

Code: Select all

/usr/local/nagios_scripts/service_change_handler.sh "%host%" "%hoststate%" "%service%" "%servicestate%" "%serviceoutput%" "%currentattempt%" "%maxattempts%"
And as you can see from our scripts logs, if I use the force immediate check link in nagios, each time I press it, a service change is detected, even when the max attempts of 5 has been reached. This didn't used to happen, we would stop getting service change alerts after the 5th time

Code: Select all

Fri Aug  3 08:26:07 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "UP". Attempt No. 1 of max of 5

Fri Aug  3 08:26:07 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 2 of max of 5

Fri Aug  3 08:26:28 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 3 of max of 5

Fri Aug  3 08:26:43 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 4 of max of 5

Fri Aug  3 08:27:08 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL: No response from NTP server" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL: No response from NTP server" nagiosxi_service logfile

Error Deleting /etc/weccache

Fri Aug  3 08:27:32 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile

Error Deleting /etc/weccache

Fri Aug  3 08:27:53 BST 2018

Recieved Service State "CRITICAL" from Service "Check NTP Time" with Service Output of "CRITICAL - Socket timeout" on Host "nodedowner.westernpower.co.uk" with Host State of "DOWN". Attempt No. 5 of max of 5

Service not ok after 5 attempts. Notifiying Tec!

/usr/bin/postemsg -S tec -m "Service Check NTP Time on host nodedowner.westernpower.co.uk is CRITICAL (NagiosXI)" event_nodename=nodedowner.westernpower.co.uk event_service="Check NTP Time" event_severity=CRITICAL service_output="CRITICAL - Socket timeout" nagiosxi_service logfile

Error Deleting /etc/weccache
Is this something I will now have to code around, or is there a bug?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by scottwilkerson »

There was 1 more commit to the maint branch last night that i believe fixes this issue.

I setup a test and once a service goes HARD it doesn't keep firing the global event handler.

It is normal for it execute the global event handler for each retry while in a SOFT state, and then the HARD state.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
kpascoe
Posts: 10
Joined: Tue Nov 27, 2012 5:40 am
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by kpascoe »

I've tried the latest maint patch again. Still no joy.

I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by scottwilkerson »

kpascoe wrote:I've tried the latest maint patch again. Still no joy.

I think I can see the issue though. After 5 or 5 checks, the state is still set to soft, it doesn't seem to get set to hard
what are you running to install the latest maint branch?

Did you restart Nagios?

I also should mention that the services will have to go into an OK state for the changes in the branch to take affect, so back to ok, then should be marked HARD after hitting 5 of 5
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
kpascoe
Posts: 10
Joined: Tue Nov 27, 2012 5:40 am
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by kpascoe »

scottwilkerson wrote:I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint​​

Code: Select all

wget https://github.com/NagiosEnterprises/nagioscore/archive/maint.tar.gz​
tar xzf maint.tar.gz​
cd nagioscore-maint​
configureflags="--with-command-group=​nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
    configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"​
make -j 2 all​
make install​

service nagios restart

After this once the services stuck in soft state go to OK state either naturally, or by stopping nagios and removing retention.dat they should no longer get stuck
Sorry for the delay in getting back

As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.

Nagios was restarted (In fact I restarted the entire server) and the service was in an OK state before I made it fail again
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Global Event Handler Continuing to alert on service down

Post by scottwilkerson »

kpascoe wrote: Sorry for the delay in getting back

As the server in question doesn't have internet access, I'm downloading the maint branch from github as a zip, unzipping it on the server and then running the commands above from the ./configure step.
This should be fine
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked