Nagios Support Forum

Posted: **Mon Aug 20, 2012 1:09 pm**

This board is so active it made for a great monitoring system already in place, so thanks.

Next one, I now have some thresholds and when we have a problem with one of our java app's it eventually stops responding (the load balancer does take it out), but I do know ahead of time by a few minutes with a nagios email on critical RAM usage. I have a bash script which will find java, kill it, wait a second, write a txt entry and then start it up again. So what's the best practice to do this. The questions / methods are;

1. These java servers are behind a load balancer, but using NRPE on a custom port, so can I run that script via NRPE (would have to get some help on where to start on that).
- or -
2. SSH shell script, I would have to allow SSH in via a custom port due to the LB, but can setup ssh key's and go that route, but hoping since #1 is already working well, just do it that way.

Again, thanks to all, great active board leads to a great project indeed!

Posted: **Mon Aug 20, 2012 1:47 pm**

Add this to the top of the bash script you already wrote (assuming it doesn't take any arguments already):

Code: Select all

case "$1" in
    CRITICAL )
        : do nothing
        ;;
    * )
        exit
esac

Then set it as the event handler for that service by setting up a command definition for it that takes $SERVICESTATE$ as its first argument. Then add 'event_handler <command_name>' to the service definition. It will automatically run when Nagios detects critical RAM usage for it.

Posted: **Mon Aug 20, 2012 3:08 pm**

ok part 1 I found ok (I assume at the top after my #!/bin/bash delare)

As for the next part, need a map (or some URL to read up). So I understand the create a command definition will be on the nagios box as well as putting that in the service, but here is the problem and some definitions;

define service{
use generic-service,srv-pnp
hostgroup_name Glassfish-Servers
service_description Memory Usage
check_command check_nrpe_lb!check_mem
}

the command is;

define command{
command_name check_nrpe_lb
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -p $_HOSTTCPORT$
}

So, currently the bash script is local to the remote java server. As I said in the first post, does the script have to be local to the nagios server, which I would then update it via ssh key's, have to do more port stuff, OR can the command be a remote one which I can say something like check_nrpe_lb_restart_java, where restart_java is defined in the local nrpe.cfg to be that shell script?

Sorry if this is written poorly or overcomplicated, just not sure how to get the script to fire and how to accomplish.

Tnx

Posted: **Mon Aug 20, 2012 3:46 pm**

The command would be run by the Nagios server, and sent to the remote one. So here is the mapping:

Service - Nagios Server
Service runs command - Nagios Server
command is sent to remote server

With an NRPE command it can return data, or trigger another script on the remote server such as restarting a service, or running a program.

What type of documentation are you looking for? Help on setting up Services? Commands? Event Handlers?

Posted: **Mon Aug 20, 2012 3:58 pm**

Sorry, I really should have included more information in my last post. So, if the script is local to your Java server and you already have NRPE set up, it would be easiest to use NRPE to run it. So, instead of setting up a new Nagios command definition, you're going to set up an NRPE command definition that looks like this:

Code: Select all

command[restart_java_app]=/path/to/script $ARG1$

That goes in nrpe.cfg on the same machine as the Java app. Also, enable dont_blame_nrpe in that same file if you haven't already. That'll enable command line arguments. Then set up the event handler in your service definition:

Code: Select all

define service{
use generic-service,srv-pnp
hostgroup_name Glassfish-Servers
service_description Memory Usage
check_command check_nrpe_lb!check_mem
event_handler check_nrpe_lb!restart_java_app!-a $SERVICESTATE$
}

I'm not sure exactly what your definition of check_nrpe_lb looks like, so I took a guess. It might have to be tweaked a little bit. Hope that helps, let me know if you get stuck!

Posted: **Tue Aug 21, 2012 9:55 pm**

ok I deleted my few posts earlier, but I have reached my stopping point. I have tweaked the files, permissions, and it works perfect local, remotely, not at all. I will re-visit permissions but currently the nagios user has sudo with nopasswd. The bash script looks like this;

Code: Select all

#!/bin/bash
case "$1" in
    CRITICAL )
        : 
         ts=`date "+%F -  %k:%m"`
          pidof java | awk '{print "kill -9 "$1;sleep 30;}' | bash
          echo restart:  $ts >> /var/log/glassfish_restart.log
         /var/lib/glassfish/bin/asadmin start-domain --user admin --passwordfile /var/lib/glassfish/docs/passwd domain1
        ;;
    * )
        exit
esac

Now, running that as the nagios user, the date/time get's written to the log, java is killed, and things restart perfect. When I run the command via the server at the command line I get;
./check_nrpe -H IP -p 5669 -c restart_gf $CRITICAL$
NRPE: Unable to read output

at the same time at the client side I see;
Aug 21 22:24:42 gfs3 nrpe[17436]: Host is asking for command 'restart_gf' to be run...
Aug 21 22:24:42 gfs3 nrpe[17436]: Running command: /usr/lib/nagios/plugins/restart_gf.sh
-- note: should it show the CRITICAL in the above line? --
Aug 21 22:24:42 gfs3 nrpe[17436]: Command completed with return code 0 and output:
Aug 21 22:24:42 gfs3 nrpe[17436]: Return Code: 0, Output: NRPE: Unable to read output

I've tried every variation, from changing the bash script to run each line prefixed with sudo, to the nrpe.cfg prefixed with sudo, the output changes to;
Aug 21 22:39:43 gfs3 nrpe[17883]: Running command: sudo /usr/lib/nagios/plugins/restart_gf.sh but still nothing.

I've tried adding the command_prefix sudo in the nrpe config and think I have tried everything else permission wise, so just completely stuck now *sigh*.

Anyway, thanks for all reads/suggestions!

Posted: **Wed Aug 22, 2012 9:54 am**

Iraymond wrote:
Code: Select all
./check_nrpe -H IP -p 5669 -c restart_gf $CRITICAL$

If you're just testing from the command line, this should be CRITICAL, not $CRITICAL$. In your config files it should be $SERVICESTATE$ instead.

It sounds like you're unsure of whether your script needs root permissions or not. To test it, try running it as the nagios user:

Code: Select all

su nagios -s /bin/bash -c "./check_nrpe -H IP -p 5669 -c restart_gf CRITICAL"

If your command needs root permissions, you need to make sure that requiretty is off for the nagios user. It has to be explicitly enabled in the sudoers file, so you should be able to clearly see whether it's enabled or not. Otherwise root permissions will be denied without an actual terminal to connect to (so it will work locally but not remotely).

Posted: **Wed Aug 22, 2012 11:40 am**

ok here ALL day now to beat the heck out of this. So, to recap;

client has /etc/passwd
nagios

107:113::/var/lib/nagios:/bin/bash (enabled just out of curiosity, but /bin/false was the default)

/etd/sudors has;
Defaults:nagios !requiretty

Still on the GF server;
(as nagios user)

Code: Select all

nagios@gfs3:/usr/lib/nagios/plugins$ ./restart_gf.sh CRITICAL

bash: line 1: kill: (18638) - Operation not permitted
No write permission: /var/lib/glassfish/domains
CLI156 Could not start the domain domain1.
(then running sudo);

Code: Select all

nagios@gfs3:/usr/lib/nagios/plugins$ sudo ./restart_gf.sh CRITICAL

Starting Domain domain1, please wait.
Default Log location is /var/lib/glassfish/domains/domain1/logs/server.log.
Redirecting output to /var/log/glassfish_server.log

So locally as root, the nagios user has sudo access to start/stop/kill, all the good stuff needed for the script.

Now on the nagios server, as root I issue the command;

Code: Select all

#su nagios -s /bin/bash -c "./check_nrpe -H 38.101.125.169 -p 5669 -c restart_gf CRITICAL"

NRPE: Unable to read output

with the same output in the remote/syslog;

Aug 22 12:36:02 gfs3 nrpe[32187]: Host is asking for command 'restart_gf' to be run...
Aug 22 12:36:02 gfs3 nrpe[32187]: Running command: /usr/bin/sudo /usr/lib/nagios/plugins/restart_gf.sh
Aug 22 12:36:02 gfs3 nrpe[32187]: Command completed with return code 0 and output:
Aug 22 12:36:02 gfs3 nrpe[32187]: Return Code: 0, Output: NRPE: Unable to read output

I'm convinced it's permissions, just not sure what/where! Glad your back working today

Posted: **Wed Aug 22, 2012 1:26 pm**

Does sudo prompt you for a password when you run it as nagios?

Posted: **Wed Aug 22, 2012 2:58 pm**

nope, it did until I added the nagios ALL = NOPASSWD: ALL option it the sudors. As I copy/pasted the exact from the client when I ran the restart_gf script, no password, nothin' but smooth sailin'

Nagios Support Forum

best way to run remote shell on critical

best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical

Re: best way to run remote shell on critical