check_live for Nagios 4

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
disconnex
Posts: 2
Joined: Mon Sep 19, 2016 9:41 am

check_live for Nagios 4

Post by disconnex »

I recently replaced my production Nagios server due to a HW issue. In doing so, I upgraded to Nagios 4 on centOS 7 from 3 on centOS 6.5. When porting over our configs and checks, the one plugin I had an issue with was check_live. I came across a bug report for RHEL that said Nagios 4 does not support livestatus which check_live uses. Does anyone know of a work around, or an alternative to check_live on Nagios 4? Or has anyone else encountered this and is willing to share the steps they did to correct it?
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: check_live for Nagios 4

Post by mcapra »

Can you share the full contents of the check_live script or the source you acquired this plugin from?
Former Nagios employee
https://www.mcapra.com/
disconnex
Posts: 2
Joined: Mon Sep 19, 2016 9:41 am

Re: check_live for Nagios 4

Post by disconnex »

Thank you for the reply. Here is the code from check_live.py

Code: Select all

#!/usr/bin/python

import socket
import sys
import argparse

parser = argparse.ArgumentParser(description='check service status for a hostgroup')
parser.add_argument('-H', '--hostgroup', help='Host group to check')
parser.add_argument('-S', '--service',   required=True, help='Service to check')
parser.add_argument('-W', '--warn',      required=True, help='Number of WARN services to alert on')
parser.add_argument('-C', '--critical',  required=True, help='Number of CRIT services to alert on')
parser.add_argument('--socket', default='/var/spool/nagios/rw/live', help='livestatus socket location, defaults to /var/spool/nagios/rw/live')
args = parser.parse_args()

if '%' in args.warn or '%' in args.critical:
  percent = True
else:
  percent = False

socket_path = args.socket

s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.connect(socket_path)

filter = ''
if args.hostgroup is not None:
  filter += 'Filter: host_groups >= %s\n' % (args.hostgroup)
filter += 'Filter: description = %s\n' % (args.service)

if percent:
  query = '''GET services
Stats: last_hard_state = 0
Stats: last_hard_state = 1
Stats: last_hard_state = 2
Stats: last_hard_state = 3
Stats: state >= 0
%s
'''% (filter)
else:
  query = '''GET services
Stats: last_hard_state = 0
Stats: last_hard_state = 1
Stats: last_hard_state = 2
Stats: last_hard_state = 3
%s
'''% (filter)

s.send(query)

s.shutdown(socket.SHUT_WR)

answer = s.recv(100000000).strip().split(';')

print "%s OK %s WARN %s CRIT %s on %s" % (answer[0], answer[1], answer[2], args.service, args.hostgroup)
if percent:
  if int(answer[2]) >= int(float(args.critical.strip('%'))/100*int(answer[4])):
    sys.exit(2)
  elif int(answer[1]) >= int(float(args.warn.strip('%'))/100*int(answer[4])):
    sys.exit(1)
  else:
    sys.exit(0)
else:
  if int(answer[2]) >= int(args.critical):
    sys.exit(2)
  elif int(answer[1]) >= int(args.warn):
    sys.exit(1)
  else:
    sys.exit(0)
This check calls on livestatus, which from what I have read is no longer supported. I came across an op5 workaround, however that GitHub page is no longer available.

An example of how we use this check:

Code: Select all

define service{
        use                     pager-service
        host_name               bumble
        service_description     CDH Tasktrackers
        check_command           check_live!-H cdh-cluster -S "HTTP - Tasktracker" -W 7% -C 7%
        }

define service{
        use                     pager-service
        host_name               bumble
        service_description     CDH Datanodes
        check_command           check_live!-H cdh-cluster -S "HTTP - Datanode" -W 7% -C 7%
        }
It's used in our clusters such as hadoop where we have hundreds of nodes and only need to be alerted when a certain percentage are down. We also use it to check the percentage of snapshot age and overflow bytes:

Code: Select all

define service{
        use                     pager-service
        host_name               bumble
        service_description     IAD TS Snapshot Age
        check_command           check_live!-H ts-ext -S "JMX - SnapshotFileLastModified" -W 50% -C 50%
        }
If there is a fix for livestatus in Nagios 4 that would be awesome however, if there is another plugin that would yield the same results that would be great too.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: check_live for Nagios 4

Post by mcapra »

You should be able to configure livestatus for nagios 4 as of version 1.2.6:
https://mathias-kettner.de/checkmk_livestatus.html

Refer to the "Livestatus now supports Nagios 4" section of this document. 1.2.8 has dependencies that fail on my CentOS 7 machine for some reason, but I was able to get it working with 1.2.6 following the instructions on that page. The paths in the Python script may need to be updated for the latest spool file locations; not entirely sure on that one.
Former Nagios employee
https://www.mcapra.com/
Locked