So we are using the service restart event handler that Nagios provided their wonderful guide for. Most of the time it works awesome, but sometimes the services are so badly messed up that the restart doesn't fix the issue and we have to go in and manually restart them again ourselves.
Is there an easy way to get the event handler to automatically trigger a second time if the service doesn't come back after the first try/a certain amount of time?
Thanks!
service restart event handler repetition.
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: service restart event handler repetition.
Man... it's been a while since I wrote that guide, I'm glad people still find it useful! You could probably trigger it a second time by submitting a passive check result to your service from it's Advanced tab, sent it an OK, then a critical again, and that should do it. The script on either end, does not account for running a second time since global event handlers are only generally triggered on state changes.
Re: service restart event handler repetition.
Haha for sure, it's wonderful!
So, is there a way to automate that? This way would also require manual intervention which we are trying to avoid.
So, is there a way to automate that? This way would also require manual intervention which we are trying to avoid.
Re: service restart event handler repetition.
Set the check to "is_volatile". This will treat every check as a state change. Next, add some logic to your event handler script to check to see if the state is hard critical (or whatever state and type on which you want to fire the event). Now, every check that is run will fire off the event handler. If all is ok, the extra logic should bail before running the actual/original event (restarting a service). If all is not well, the original event commands will fire. On the next iteration of the check, if the service is not ok, the event will fire again, and so on until the service recovers.
Does that make sense?
Does that make sense?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: service restart event handler repetition.
OK, so...
The script on the nagios server only triggers event handlers if the service enters a hard state:
The script on the server restarts the service with a 3 minute wait between stop and start.
The service is set with a retry interval of 4 minutes, with a max check attempt of 2. It is ALSO now set to volatile.
So if I'm understanding what you are saying correctly, if after the second initial check the service is down, the event handler triggers. This is the normal part. but now that it's set to volatile, if after another 4 minutes when it rechecks again, and the event handler restart failed, it will basically send a new hard state and re-trigger the event handler, repeating this every 4 minutes until the service returns an OK?
The script on the nagios server only triggers event handlers if the service enters a hard state:
Code: Select all
#!/bin/sh
# Event Handler for Restarting Linux/BSD/Windows Services
# Assumes
# $USER1$/servicerestart.sh $SERVICESTATE$ $HOSTADDRESS$ $_SERVICESERVICE$
case "$1" in
OK)
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL)
if [ "$4" == "HARD" ];then
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c runcmd -a "$3"
fi
;;
esac
exit 0The service is set with a retry interval of 4 minutes, with a max check attempt of 2. It is ALSO now set to volatile.
So if I'm understanding what you are saying correctly, if after the second initial check the service is down, the event handler triggers. This is the normal part. but now that it's set to volatile, if after another 4 minutes when it rechecks again, and the event handler restart failed, it will basically send a new hard state and re-trigger the event handler, repeating this every 4 minutes until the service returns an OK?
Re: service restart event handler repetition.
That sounds about right. Keeping a one-minute buffer between the service start and the next check was a good idea too.
Former Nagios employee