Page 1 of 1

service checks failing from scheduler, not cli

Posted: Tue Jun 24, 2014 9:49 am
by kendallchenoweth
I have three checks that run from the command line (taken from from the service test command) that work but fail when run by the Nagios scheduler. This same check is working for other service checks but not against three hosts.

/usr/local/nagios/libexec/mw_check_mathworks_app.pl -H services-integ1.mathworks.com -a activationws -p 80
OK: All tests passed

The same check running from the scheduler produces a critical 404 not found.

What is the best debug mode for Nagios to see what's happening when the scheduler runs this check?

Thanks in advance.

Re: service checks failing from scheduler, not cli

Posted: Tue Jun 24, 2014 1:50 pm
by lmiltchev
Can you show us the service and command definitions for this check? Can you run the same command in the CLI as nagios user and get the correct output?

Re: service checks failing from scheduler, not cli

Posted: Tue Jun 24, 2014 2:05 pm
by slansing
Can you show us the command definition you are using for the service check, as well as the service's own configuration so we can take a look at what you've got? Nagios is going to run the command you give it, so it is important you pair the arguments correctly, etc. That is what we want to take a look at.

Re: service checks failing from scheduler, not cli

Posted: Tue Jun 24, 2014 2:15 pm
by kendallchenoweth
/usr/local/nagios/libexec/mw_check_mathworks_app.pl -H services-integ1.mathworks.com -a activationws -p 80

Code: Select all

define service {
	service_description		HTTP /activationws 80
	use				webvip-24x7-service_notify
	hostgroup_name			services-integ
	display_name			HTTP /activationws 80
	check_command			check-mw-app!activationws!-p 80!!!!!!
	obsess_over_service		1
	register			1
	}	

Code: Select all

define command {
       command_name                             check-mw-app
       command_line                             $USER1$/mw_check_mathworks_app.pl -H $HOSTNAME$ -a $ARG1$ $ARG2$
}
mw_check_mathworks_app.pl

Code: Select all

#!/usr/bin/perl -w
####################### check_apachestatus.pl #######################
# Version : 1.1
# Date : 27 Jul 2007
# Author  : De Bodt Lieven (Lieven.DeBodt at gmail.com)
# Licence : GPL - http://www.fsf.org/licenses/gpl.txt
#############################################################
#
# help : ./check_passenger_app.pl -h

use strict;
use Getopt::Long;
use LWP::UserAgent;
use Time::HiRes qw(gettimeofday tv_interval);

# Nagios specific

#use lib "/usr/lib64/nagios/plugins";
use lib "/usr/local/nagios/libexec";
use mw_utils qw(%ERRORS $TIMEOUT);
my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);

# Globals

my $Version='1.0';
my $Name=$0;

my $o_host =		undef; 		# hostname
my $o_help=		undef; 		# want some help ?
my $o_port = 		undef; 		# port
my $o_version= 		undef;  	# print version
my $o_app= 		undef;  	# app name
my $o_warn_level=	"DEGRADED";  	# Number of available slots that will cause a warning
my $o_crit_level=	"FAILURE";  	# Number of available slots that will cause an error
my $o_timeout=  	60;            	# Default 15s Timeout
my $o_vhost=	  	undef;         	# Virtual Host
my $o_ua=               "MathWorks Nagios App Checker";          # Useragent

# functions

sub show_versioninfo { print "$Name version : $Version\n"; }

sub print_usage {
  print "Usage: $Name -H <host> -a <app> [-p <port>] [-t <timeout>] [-h <vhost>] [-V]\n";
}

# Get the alarm signal
$SIG{'ALRM'} = sub {
  print ("ERROR: Alarm signal (Nagios time-out)\n");
  exit $ERRORS{"CRITICAL"};
};

sub help {
  print "Passenger App Monitor for Nagios version ",$Version,"\n";
  print_usage();
  print <<EOT;
-h, --help
   print this help message
-H, --hostname=HOST
   name or IP address of host to check
-p, --port=PORT
   Http port
-a, --app=APP
   Passenger application name
-u, --useragent=USER-AGENT
   User agent to send the request with
-t, --timeout=INTEGER
   timeout in seconds (Default: $o_timeout)
-v, --vhost=VHOST
   Name based virtual host
-V, --version
   prints version number
Note :
  The script will return
        OK       if the status check returns all OK,
        WARNING  if there is one or more DEGRADED checks,
        CRITICAL if there is one or more FAILURE checks, or the check doesn't respond,
        UNKNOWN

EOT
}

sub check_options {
  Getopt::Long::Configure ("bundling");
  GetOptions(
      'h'     => \$o_help,        'help'          => \$o_help,
      'H:s'   => \$o_host,        'hostname:s'	  => \$o_host,
      'p:i'   => \$o_port,        'port:i'	  => \$o_port,
      'a:s'   => \$o_app,         'app:s'	  => \$o_app,
      'V'     => \$o_version,     'version'       => \$o_version,
      't:i'   => \$o_timeout,     'timeout:i'     => \$o_timeout,
      'v:s'   => \$o_vhost,       'vhost:s'       => \$o_vhost,
      'u:s'   => \$o_ua,          'useragent:s'   => \$o_ua,

  );

  if (defined ($o_help)) { help(); exit $ERRORS{"UNKNOWN"}};
  if (defined($o_version)) { show_versioninfo(); exit $ERRORS{"UNKNOWN"}};
  # Check compulsory attributes
  if (!defined($o_host)) { print_usage(); exit $ERRORS{"UNKNOWN"}};
  if (!defined($o_app)) { print_usage(); exit $ERRORS{"UNKNOWN"}};
  if (!defined($o_vhost)) { $o_vhost=$o_host };
}

########## MAIN ##########

check_options();

my $ua = LWP::UserAgent->new( protocols_allowed => ['http'], timeout => $o_timeout, agent => $o_ua
);
$ua->default_header(Host => $o_vhost);
my $timing0 = [gettimeofday];
my $response = undef;
if (!defined($o_port)) {
  $response = $ua->get('http://' . $o_host . '/' . $o_app . '/admin/status');
} else {
  $response = $ua->get('http://' . $o_host . ':' . $o_port . '/' . $o_app .  '/admin/status');
}
my $timeelapsed = tv_interval ($timing0, [gettimeofday]);
my $status = undef;
my $critical = undef;

my $webcontent = undef;
if ($response->code eq 200 || $response->code eq 500) {
#####if ($response->code eq 200) {
  $status = "OK" }
else {
  $status = "RESPONSE_ERROR" }

if ($status eq "OK") {
  $webcontent=$response->content;
  my @webcontentarr = split("</tr>", $webcontent);
  my $i = 0;
  # Get overall status
  while ($i < @webcontentarr)  {
  #print $webcontentarr[$i] . "\n\n\n";
    for ($webcontentarr[$i]) {
      if    (/<title>(\D+)<\/title>/i) { $status = $1; }
      elsif (/<td.*?>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>(<strong>)?(.*?)(<\/strong>)?<\/td>/) {
        my $test = $1;
        $critical = $2;
        my $teststat = $4;
        if ( $teststat !~ m/ok/i ) {
          print $teststat . ": " . $test . " test failed" . "\n"; }
      }
    }
    $i++;
  }
}

for ($status) {
  if    ( /OK/i ) {
    print "OK: All tests passed\n";
    exit $ERRORS{"OK"} }
  elsif ( /DEGRADED/i ) { exit $ERRORS{"WARNING"} }
  elsif ( /FAILURE/i ) { exit $ERRORS{"CRITICAL"} }
  else {
    printf("CRITICAL %s\n", $response->status_line);
    exit $ERRORS{"CRITICAL"} }
}

Re: service checks failing from scheduler, not cli

Posted: Tue Jun 24, 2014 4:25 pm
by sreinhardt
What happens if you run the command as the nagios user? Also what are the permissions on your plugin?

Code: Select all

su - nagios -c '/usr/local/nagios/libexec/mw_check_mathworks_app.pl -H services-integ1.mathworks.com -a activationws -p 80'
ls -lart /usr/local/nagios/libexec/mw_check_mathworks_app.pl

Re: service checks failing from scheduler, not cli

Posted: Wed Jun 25, 2014 9:01 am
by kendallchenoweth

Code: Select all

[nagios@nagiosxinonprod-00-ls libexec]$ whoami
nagios
[nagios@nagiosxinonprod-00-ls libexec]$ /usr/local/nagios/libexec/mw_check_mathworks_app.pl -H services-integ1.mathworks.com -a activationws -p 80
OK: All tests passed
[nagios@nagiosxinonprod-00-ls libexec]$ ls -lart /usr/local/nagios/libexec/mw_check_mathworks_app.pl
-rwxr-xr-x 1 nagios users 4842 May 20 14:29 /usr/local/nagios/libexec/mw_check_mathworks_app.pl
This same check works in Core 3.4.1 from the CLI And scheduler. I was thinking it had something to do with the DNS redirect through a load balancer.

Code: Select all

nslookup
> services-integ1
Server:		X.X.X.X.
Address:	X.X.X.X#53

services-integ1.mathworks.com	canonical name = lb.mathworks.com.
Name:	lb.mathworks.com
Address: Y.Y.Y.Y
> services-integ2

services-integ2.mathworks.com	canonical name = lb.mathworks.com.
Name:	lb.mathworks.com
Address: Y.Y.Y.Y
> services-integ3

services-integ3.mathworks.com	canonical name = lb.mathworks.com.
Name:	lb.mathworks.com
Address: Y.Y.Y.Y
Because of this wrinkle, I was hoping to find the optimal flags to put into the debug options so I can see how the scheduler is executing the check vs the command line...

Re: service checks failing from scheduler, not cli

Posted: Wed Jun 25, 2014 10:51 am
by kendallchenoweth
I found a difference in the host resolution and am pursuing the problem there. You can close this ticket. Thanks!