Global event hander - Executing mutliple times
Posted: Fri Mar 15, 2019 3:54 am
Hi,
I have developed a set of PHP scripts to log critical hard alerts into our ITSM system and update the ticket when the service recovers or is acknowledged.
99% of the time, everything works as it should, however, I'm having a sparadic issues with the global event handler executing mulitple times and so causing mutliple tickets to be logged at the same time.
Looking through the logs from my scripts from the most recent occurance, it seems as if when the first hard event handler is triggered to log the call the script did not get a response in a reasonable time from our ITSM system and so the script did not fully complete and was hanging for a while, and then Nagios over 5 minutes attempted to re-trigger the global event handler for the same issue every 30 seconds.
21:09:29: HARD alert from Nagios XI for service A critical
Tries to log a call
21:10:30: HARD alert from Nagios XI for service A critical
Tries to log a call
21:11:31: HARD alert from Nagios XI for service A critical
Tries to log a call
21:12:32: HARD alert from Nagios XI for service A critical
Tries to log a call
21:13:34: HARD alert from Nagios XI for service A critical
Tries to log a call
21:14:35: HARD alert from Nagios XI for service A critical
Tries to log a call
21:16:34: Over the next 20 seconds, all the above alerts log a call
Just to help narrow this down and for me to try and put a fix in, can you confirm that if a global event handler executes a script which does not complete, Nagios will attempt to re-execute the event handler until it does?
Thanks.
Edit:
It looks like Nagios has an option (event_handler_timeout) which kills scripts which have been running for longer than 30 seconds (default), however, I don't believe this is working as the script is still carrying out various actions 5 minutes after initially invoked.
I have developed a set of PHP scripts to log critical hard alerts into our ITSM system and update the ticket when the service recovers or is acknowledged.
99% of the time, everything works as it should, however, I'm having a sparadic issues with the global event handler executing mulitple times and so causing mutliple tickets to be logged at the same time.
Looking through the logs from my scripts from the most recent occurance, it seems as if when the first hard event handler is triggered to log the call the script did not get a response in a reasonable time from our ITSM system and so the script did not fully complete and was hanging for a while, and then Nagios over 5 minutes attempted to re-trigger the global event handler for the same issue every 30 seconds.
21:09:29: HARD alert from Nagios XI for service A critical
Tries to log a call
21:10:30: HARD alert from Nagios XI for service A critical
Tries to log a call
21:11:31: HARD alert from Nagios XI for service A critical
Tries to log a call
21:12:32: HARD alert from Nagios XI for service A critical
Tries to log a call
21:13:34: HARD alert from Nagios XI for service A critical
Tries to log a call
21:14:35: HARD alert from Nagios XI for service A critical
Tries to log a call
21:16:34: Over the next 20 seconds, all the above alerts log a call
Just to help narrow this down and for me to try and put a fix in, can you confirm that if a global event handler executes a script which does not complete, Nagios will attempt to re-execute the event handler until it does?
Thanks.
Edit:
It looks like Nagios has an option (event_handler_timeout) which kills scripts which have been running for longer than 30 seconds (default), however, I don't believe this is working as the script is still carrying out various actions 5 minutes after initially invoked.