Setup:
I have an important Web-Service that is supposed to run 24x7 and thus being monitored 24x7.
Code: Select all
check_interval 2
retry_interval 1
max_check_attempts 3
notification_interval 30
notification_period 24x7
Situation:
From time to time, long running tasks are being performed, especially at nights (Backups, scheduled import-jobs, cleanup-jobs, etc.).
These tasks sometimes make the App-Server respond "slow" (i.e. not within the time, configured in Nagios), thus leading to a CRITICAL state, which is then (correctly) being dealt by the event handler, which kicks the App-Server and thus breaks any running jobs...
My first idea was to create a new timeperiod (called "scheduled-tasks", mon. - sun. from 00:00 - 05:00 hrs) and to enhance the event handler to take care of this by means of the macro $ISVALIDTIME.
My script, which is based on the default script from Nagios, (and has already successfully been enhanced to take care of scheduled downtimes, etc.) correctly takes care of the timeperiod and instead of restarting Tomcat it just logs, that it has detected an issue with the service. All fine again, BUT:
Issue:
Since the the service is in a CRITICAL HARD state and sometimes never really leaves this state, the service is also not being restartet, when the specified "scheduled-tasks"-timeperiod runs off (i.e. @ 0500hrs) and remains faulty until a manual restart.
I'm looking for a smart way to work-around this issue, and this is, where you can join in
How can I achieve my goal to keep on checking the service 24x7 but just ignoring a faulty state during the specified off-hours and yet use an automatic restart (through event handler) after this timeperiod without manual interaction?
Approach:
My current thoughts are, to inject an external command at that part of the script that just logs an error instead od restarting the service, such as PROCESS_SERVICE_CHECK_RESULT, and just re-setting the state (back to "0"=OK), but I'm not sure if there isn't a better / smarter way to handle the situation. My attempt looks a bit "hackish" to me...
I hope, I made myself (somewhat) clear. If you need any more information pls. don't hesitate to let me know.
Many thanks in advance for brainstorming with me on this issue
cheers,
Dennis