Alarm for % (or count) of Hosts down?

gregbeyer · Post by **gregbeyer** » Mon Sep 09, 2024 11:23 am

Is it possible to create a monitor to watch for a percent of nodes going down in a hostgroup? In high performance grid computing with all it's redundancy, we don't care that one or two nodes, or event a dozen in a large cluster go down -- happens on a daily basis. But if a significant percent of a cluster goes down (we have clusters in HG's) -- that indicates trouble, and we want to know.

I've explored BPI some, but our clusters are hundreds or thousands of nodes. 1) Not practical to add each node in a cluster watch, and 2) each node is not critical anyway, so I think BPI doesn't work for this scenario.

I think what I'm wanting to do is synthetic monitor. That's what Grafana calls this. Thanks for ideas.

DoubleDoubleA · Post by **DoubleDoubleA** » Mon Sep 09, 2024 4:18 pm

Hi @gregbeyer,

I think you might find BPI is exactly what you need. https://assets.nagios.com/downloads/nag ... BPI_v2.pdf

You should be able to add from the Hostgroup tab. You can have a BPI check that is just your cluster, and you can set it to some threshold of health, like we're ok as long as 70% of hosts are up.

Maybe take another look and reference the doc there, and check back if it still doesn't work like you'd hoped.

Aaron

bbahn · Post by **bbahn** » Mon Sep 09, 2024 4:28 pm

Hello @gregbeyer,

If you add your hostgroups/servicegroups to BPI and just hit edit on the groups in question, you can hit edit then set a threshold for each hostgroup/servicegroup.

If you have too many hostgroups/servicegroups and don't want to do this by hand, sync your hostgroups/servicegroups with BPI and then click the settings icon in BPI (the cog). You will see a BPI configuration file (default /usr/local/nagiosxi/etc/components/bpi.conf). You can use a sed command or whatever script you prefer to mass-edit this file to change the config items from the defaults like the following:

Code: Select all

define hg_linux-servers {
        title=HG: linux-servers
        desc=
        primary=1
        info=
        members=localhost;NULL;&, 
        warning_threshold=0
        critical_threshold=0 
        priority=0
        type=hostgroup
        auth_users=
}

to whatever warning and critical thresholds you would like. (warning_threshold=50 means 50%)

Note that these are health thresholds and the critical value should be lower than the warning value.

kg2857 · Post by **kg2857** » Tue Sep 10, 2024 12:03 am

If BPI isn't what you want you can write a plugin (script) to do a few SQL queries a bit of math and create an alert.

gregbeyer · Post by **gregbeyer** » Tue Sep 10, 2024 11:04 am

After taking another look at BPI and with @DoubleDoubleA 's suggestion on how to use HG, I tried it with one of my smaller clusters. Adding the HG actually added each node in the HG, from which I could also select High Priority hosts. I set a percent for warn and critical. You're right, BPI works.

So I did the same with a larger cluster with 1400 nodes in the HG. Unfortunately, BPI seems to be hung. It's been clocking since yesterday, see shot. No previously created BPI's groups, no nodes, HG's or SG's will list. Same clocking regardless of which tab I select. Attempted to resolve my cycling first my browser, then cycling nagios, mysql, httpd. When I re-open XI, go to BPI, I find the spinner still going. All other functions of XI seem fine.

What hosed BPI? Surely it can handle a 1400 member HG? How do I kill the clocking?

bbahn · Post by **bbahn** » Tue Sep 10, 2024 11:25 am

Hello @gregbeyer,

Can you check your /usr/local/nagiosxi/var/cmdsubsys.log? That should tell you what's going on.

gregbeyer · Post by **gregbeyer** » Tue Sep 10, 2024 11:39 am

grep'd the file for "bpi":

Mon, 09 Sep 2024 12:43:36 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:09:20 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:06 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:52 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:11:30 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 15:18:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 01:16:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 11:53:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0

jmichaelson · Post by **jmichaelson** » Tue Sep 10, 2024 4:06 pm

Hi Greg. Right now I'd try increasing the PHP memory limit past the default 1GB. Edit the memory_limit line in the php.ini file (its location is distribution-dependent). And then restart php-fpm on EL distros, or apache2 on Debian/Ubuntu distros.

snapier3 · Post by **snapier3** » Wed Sep 11, 2024 2:08 pm

Reading this post I thought a check of this sort would be a good idea.

I put a basic python plugin together to gather the hostgroup membership info from the API and check for down hosts in the group.
The plugin gives you a percentage of hosts down for the group and evaluates it against the thresholds you provide to alert.

check_pctgroup.py

Code: Select all

import requests, sys, argparse, os, json, yaml

#DEAL WITH THE SELF SIGNED NAGIOS SSL
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

#NAGIOSXI PLUGIN TO ALERT WHEN X PERCENT OF A HOSTGROUP ARE IN A DOWN STATE
#SNAPIER

#SCRIPT DEFINITION
cname = "check_pctgroup"
cversion = "0.0.1"
cpath = os.path.dirname(os.path.realpath(__file__))

##NAGIOSXI DIRECT API CALL
def nagiosxiGenericAPI(resource,endpoint,modifier,method,myurl,mykey):
    
    #URL FOR APICALL TO NAGIOSXI
    url = ("https://{turl}/nagiosxi/api/v1/{resource}/{endpoint}?{modifier}&apikey={akey}".format(turl=myurl,akey=mykey,resource=resource,endpoint=endpoint,modifier=modifier)) 

    #ONLY ALLOW FOR USE OF GET IN THIS INSTANCE
    if method == "get":
        try:
            r = requests.get(url=url,verify=False)
        except Exception as e:
            print("ERROR: %s",e)
            r = False
    else:
        r = False
    return r


##CREDENTAILS USED TO GATHER DATA VIA THE NAGIOSXI API
#PRO TIP: A UNIFIED YML CAN BE USED MULTIPLE PLUGINS  
def nagiosxiAPICreds(meta):
    env = meta.nenv
    with open(cpath+"/check_pctgroup.yaml", "r") as yamlfile:
        try:
            data = yaml.safe_load(yamlfile)
            r = {"url":data[0]["nagios"][env]["url"],"apikey":data[0]["nagios"][env]["apikey"]}
        except Exception as e:
            print("ERROR: %s",e)
            r = False
        finally:
            return r

#STATE FROM STATEID
def checkStateFromCode(i):
    switcher = {
        0: "OK",
        1: "WARNING",
        2: "CRITICAL",
        3: "UNKNOWN"
    }

    #GIVE THE STATE BACK
    return switcher.get(i)

#NAGIOS EXIT
def nagExit(stateid,msg):
    #ENRICH IF NEEDED
    print(msg)
    #EXIT WITH THE STATEID
    sys.exit(stateid)


if __name__ == "__main__" :
    
    #INPUT FROM NAGIOS
    args = argparse.ArgumentParser(prog=cname+"v:"+cversion, formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    #NAGIOSXI TARGET
    args.add_argument(
        "-e","--nenv",
        required=True,
        default=None,
        help="String(nagiosenvironment): NagiosXI Instance definition stored in the yml.(dev,prd)"
    ),
    #HOSTGROUP
    #SINGLE GROUP
    args.add_argument(
        "--hostgroup",
        required=True,
        default=None,
        help="String(hostgroup): NagiosXI hostgroup to evaluate."
    ),
    args.add_argument(
        "-w", "--warning",
        required=False,
        default=None,
        help="String(warning): NagiosXI Warning Value"
    )
    args.add_argument(
        "-c","--critical",
        required=True,
        default=None,
        help="String(critical): NagiosXI Critical Value"
    )
    args.add_argument(
        "-t","--timeout",
        required=False,
        default='30',
        help="int(timeout): NagiosXI check timeout value."
    )
    args.add_argument(
        "-p", "--perfdata",
        required = False,
        action = "store_true",
        help="boolean(perfdata): Include NagiosXI perfdata in check output msg if enabled."
    )

    #PARSE ARGS
    meta = args.parse_args()

    #THE CHECK BODY
    try:
        #COLLECT THE DATA
        ##NAGIOS API CREDS
        auth = nagiosxiAPICreds(meta)

        ##GET THE HOSTGROUPMEMBERS FOR THE TARGET GROUP
        modhg = "&hostgroup_name={}".format(meta.hostgroup)
        hostgm = nagiosxiGenericAPI("objects","hostgroupmembers",modhg,"get",auth["url"],auth["apikey"])
        hd = hostgm.json()

        ##BUILD THE LIST 
        memlst = list()
        totalhost = 0
        members = hd["hostgroup"][0]["members"]['host']
        for i in members:
            memlst.append(i["host_name"])
            totalhost += 1
        
        ##GET STATUS OF the LIST OF HOSTGROUP MEMBERS
        nhl = ','.join(memlst)
        modhgm = "&host_name=in:{}&current_state=1".format(nhl)
        hoststats = nagiosxiGenericAPI("objects","hoststatus",modhgm,"get",auth["url"],auth["apikey"])
        stats = hoststats.json()

        ##GET THE PERCENTAGE OF DOWN HOSTS
        dwn = (float(stats["recordcount"]) / totalhost * 100)
        
        ##EVALUATE THE RETURNED DATA
        ###FIRST IS WORSE
        if(int(dwn) >= int(meta.critical)):
            stateid = 2
            state = checkStateFromCode(stateid)
            msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))
            
        ###WARNINING SHOULD BE OPTIONAL SO HERE WE ONLY PROCESS FOR WARNING IF PRESENT
        elif meta.warning and ((int(dwn) < int(meta.critical)) and (int(dwn) >= int(meta.warning))):
            stateid = 1
            state = checkStateFromCode(stateid)
            msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))

        ###NOT WARNING NOT CRITICAL IT"S OK
        else:
            stateid = 0
            state = checkStateFromCode(stateid)
            msg = ('{} - All {} members of {} are UP.'.format(state,totalhost,meta.hostgroup))
        
        ###NOT EVERYONE WANTS PERFDATA (WHY?)
        if meta.perfdata:
            if meta.warning and meta.warning != None:
                wrn = meta.warning
            else:
                wrn = ""
            perfdata = (' | group-down-percent={}%;{};{}; group-total-count={}; group-down-count={};'.format(dwn,wrn,meta.critical,totalhost,stats["recordcount"]))
            msg = msg + perfdata
    
    #UNKNOWNS SERVE A PURPOSE (USE THEM WISELY)
    except Exception as e:
        stateid = 3
        state = checkStateFromCode(stateid)
        msg = e
    
    #IT'S ALL ABOUT THE EXIT
    finally:            
        nagExit(stateid,msg)

Create the check_pctgroup.yaml file and put the file in the same directory as the script.

Code: Select all

- nagios:
    dev:
      apikey: <your-api-key>
      url: <fqdn/ip>
    prd:
      apikey: <your-api-key>
      url: <fqdn/ip>

Command line

Code: Select all

python3 check_pctgroup.py -e dev --hostgroup "<hostgroupname>" -c (int|required) -w (int|optional)  -p (optional)

Check Results

Code: Select all

OK - All 1 members of dev-linux-web are UP. | group-down-percent=0.0%;;10; group-total-count=1; group-down-count=0;

I'll throw it up on github later this week.

Happy Monitoring!
--SN

snapier3 · Post by **snapier3** » Thu Sep 12, 2024 11:22 am

I put the plugin up on GitHub.
(I did change the exit message a bit)
https://github.com/SNapier/check_pctgroup

Nagios Service Config

nagios-service-config.PNG

NagiosXI Service -OK

nagios-service-exit.PNG

NagiosXI Service - CRITICAL

nagios-service-exit-crit.PNG

NagiosXI Perfdata

nagios-service-perfdata.PNG

Nagios Support Forum

Alarm for % (or count) of Hosts down?

Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?