[SOLVED] Can I limit the execution time of my event handler

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
dennisg
Posts: 14
Joined: Wed May 31, 2017 7:28 am

[SOLVED] Can I limit the execution time of my event handler

Post by dennisg »

Hi swarm intelligence, I need your help:

Setup:
I have an important Web-Service that is supposed to run 24x7 and thus being monitored 24x7.

Code: Select all

check_interval                  2
retry_interval                  1
max_check_attempts              3
notification_interval           30
notification_period             24x7
An event handler has been installed and restarts the App-Server (tomcat) as expected. Everyhting is running fine so far.

Situation:
From time to time, long running tasks are being performed, especially at nights (Backups, scheduled import-jobs, cleanup-jobs, etc.).
These tasks sometimes make the App-Server respond "slow" (i.e. not within the time, configured in Nagios), thus leading to a CRITICAL state, which is then (correctly) being dealt by the event handler, which kicks the App-Server and thus breaks any running jobs...

My first idea was to create a new timeperiod (called "scheduled-tasks", mon. - sun. from 00:00 - 05:00 hrs) and to enhance the event handler to take care of this by means of the macro $ISVALIDTIME.
My script, which is based on the default script from Nagios, (and has already successfully been enhanced to take care of scheduled downtimes, etc.) correctly takes care of the timeperiod and instead of restarting Tomcat it just logs, that it has detected an issue with the service. All fine again, BUT:

Issue:
Since the the service is in a CRITICAL HARD state and sometimes never really leaves this state, the service is also not being restartet, when the specified "scheduled-tasks"-timeperiod runs off (i.e. @ 0500hrs) and remains faulty until a manual restart.

I'm looking for a smart way to work-around this issue, and this is, where you can join in :)
How can I achieve my goal to keep on checking the service 24x7 but just ignoring a faulty state during the specified off-hours and yet use an automatic restart (through event handler) after this timeperiod without manual interaction?

Approach:
My current thoughts are, to inject an external command at that part of the script that just logs an error instead od restarting the service, such as PROCESS_SERVICE_CHECK_RESULT, and just re-setting the state (back to "0"=OK), but I'm not sure if there isn't a better / smarter way to handle the situation. My attempt looks a bit "hackish" to me... :-?

I hope, I made myself (somewhat) clear. If you need any more information pls. don't hesitate to let me know.
Many thanks in advance for brainstorming with me on this issue :)

cheers,
Dennis
Last edited by dennisg on Mon Feb 12, 2018 9:14 am, edited 2 times in total.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Can I limit the execution time of my event handler

Post by npolovenko »

Hello, @dennisg.
Can you share the event handler script with us? What if you replace theservice tomcat restart command with:

Code: Select all

1. service tomcat stop
2. service tomcat start
Essentially the same thing but it'll be able to start the tomcat even when it's completely off. Also, what command are you using to check the tomcat service, can you upload it? You might be able to just increase the timeout value.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
dennisg
Posts: 14
Joined: Wed May 31, 2017 7:28 am

Re: Can I limit the execution time of my event handler

Post by dennisg »

Hi @npolovenko,

thanks for your reply.
Can you share the event handler script with us
Not that easily as it contians "internal information", which I first would have to clear off. Again: It's basically the script from Nagios that I had linked previously, enhanced to send mails and to take care of scheduled downtimes.

The thing is: During this period of scheduled tasks I don't want the tomcat to be restarted, so changing from service tomcat restart to stop and start in dedicated commands wouldn't be the solution I've been looking for.
Also, what command are you using to check the tomcat service, can you upload it? You might be able to just increase the timeout value.
It's basically a check_http. Increasing the timeout is also not an option as this would affect the behaviour all around the clock (and make an automatic restart being executed too few at other times, e.g. office hours.

Another solution came into my mind, which I will be trying out on monday: Injecting a service command SCHEDULE_SVC_DOWNTIME via cron, that defines a scheduled downtime for the service each night between 00:00 - 05:30.
Advantages:
* No need for a specific timeperiod
* No need to change the event handler
* No impact on other services (being checked with the same command)
* somewhat documented, as I can include a service comment as well.

What do you think?

Best regards,
Dennis
dennisg
Posts: 14
Joined: Wed May 31, 2017 7:28 am

Re: Can I limit the execution time of my event handler

Post by dennisg »

The "automatic scheduled downtime" seems to provide the expected result, so I'm gonna mark this thread as "solved".
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: [SOLVED] Can I limit the execution time of my event hand

Post by tmcdonald »

Did you have any further (related) follow-up questions or are we good to lock this up?
Former Nagios employee
dennisg
Posts: 14
Joined: Wed May 31, 2017 7:28 am

Re: [SOLVED] Can I limit the execution time of my event hand

Post by dennisg »

No, thanks. Pls. go ahead and lock it up.
Locked