Page 1 of 1

restart a service by SSH.

Posted: Thu Aug 30, 2012 8:47 am
by lraymond
ok, had a nice long thread which sadly never got working where I had a java server run out of RAM, can trip a memory alert, but using NRPE just can't kill and restart the service. Had some great help ideas, but just never got it to kill/restart.

So, still having the java issues, I am wondering can I kick a local process off, well I'm sure I can, but wondering what/how. I can setup some SSH key's, setup some port forwarding on my load balancer, so when a critical hit via the check_nrpe!check_mem, simply say run /usr/lib/nagios/plugins/restartremoteservice.sh

That would be a local bash script that would ssh in (use key's so no password) kill the java pid and restart the app!

Thanks to everyone who tried restart version 1, so now gonna try restart version 2 :)

Re: restart a service by SSH.

Posted: Thu Aug 30, 2012 9:41 am
by yancy
Iraymond,

It sounds like you already have a working script which will SSH to a machine and perform some actions.

You can setup event handling to execute your script upon a particular event (such as a critical check_nrpe)
http://nagios.sourceforge.net/docs/3_0/ ... dlers.html

Regards,

-Yancy

Re: restart a service by SSH.

Posted: Fri Aug 31, 2012 11:58 am
by lraymond
Cool, got things going. The only issue it seems is when the event fires, it does find/kill java, writes and entry in the log and then fires it again 2 minutes later it seems. So I woud like in the host to say check every minute or two, but if something happens, fire the event handler, then wait 5 minutes (something like that). The service looks like this;

Code: Select all

define service{
          use                   generic-service
          host_name             GFS3
          service_description   Memory + Restart
          check_command         check_nrpe_lb!check_mem
          event_handler         restart_gfs3
          max_check_attempts    1
          check_interval        2   
          retry_interval        2
}
I looked http://nagios.sourceforge.net/docs/3_0/ ... val_length trying to see and use an interval_length but nagios complained on restart about it. So I did change the check interval to 4, but it's still not enough time it seems as the script does come back 4 minutes later and restarts it. The second time is enough and the 3rd pass all is green, but I would like to say;

check every minute. If critical, don't wait for a 2nd attempt, just fire the event handler, then goto sleep and wait 5 minutes (something like that).

Thanks