Page 1 of 2
Alarm for % (or count) of Hosts down?
Posted: Mon Sep 09, 2024 11:23 am
by gregbeyer
Is it possible to create a monitor to watch for a percent of nodes going down in a hostgroup? In high performance grid computing with all it's redundancy, we don't care that one or two nodes, or event a dozen in a large cluster go down -- happens on a daily basis. But if a significant percent of a cluster goes down (we have clusters in HG's) -- that indicates trouble, and we want to know.
I've explored BPI some, but our clusters are hundreds or thousands of nodes. 1) Not practical to add each node in a cluster watch, and 2) each node is not critical anyway, so I think BPI doesn't work for this scenario.
I think what I'm wanting to do is synthetic monitor. That's what Grafana calls this. Thanks for ideas.
Re: Alarm for % (or count) of Hosts down?
Posted: Mon Sep 09, 2024 4:18 pm
by DoubleDoubleA
Hi @gregbeyer,
I think you might find BPI is exactly what you need.
https://assets.nagios.com/downloads/nag ... BPI_v2.pdf
You should be able to add from the Hostgroup tab. You can have a BPI check that is just your cluster, and you can set it to some threshold of health, like we're ok as long as 70% of hosts are up.
Maybe take another look and reference the doc there, and check back if it still doesn't work like you'd hoped.
Aaron
Re: Alarm for % (or count) of Hosts down?
Posted: Mon Sep 09, 2024 4:28 pm
by bbahn
Hello @gregbeyer,
If you add your hostgroups/servicegroups to BPI and just hit edit on the groups in question, you can hit edit then set a threshold for each hostgroup/servicegroup.
If you have too many hostgroups/servicegroups and don't want to do this by hand, sync your hostgroups/servicegroups with BPI and then click the settings icon in BPI (the cog). You will see a BPI configuration file (default
/usr/local/nagiosxi/etc/components/bpi.conf). You can use a sed command or whatever script you prefer to mass-edit this file to change the config items from the defaults like the following:
Code: Select all
define hg_linux-servers {
title=HG: linux-servers
desc=
primary=1
info=
members=localhost;NULL;&,
warning_threshold=0
critical_threshold=0
priority=0
type=hostgroup
auth_users=
}
to whatever warning and critical thresholds you would like. (
warning_threshold=50 means 50%)
Note that these are health thresholds and the critical value should be lower than the warning value.
Re: Alarm for % (or count) of Hosts down?
Posted: Tue Sep 10, 2024 12:03 am
by kg2857
If BPI isn't what you want you can write a plugin (script) to do a few SQL queries a bit of math and create an alert.
Re: Alarm for % (or count) of Hosts down?
Posted: Tue Sep 10, 2024 11:04 am
by gregbeyer
After taking another look at BPI and with @DoubleDoubleA 's suggestion on how to use HG, I tried it with one of my smaller clusters. Adding the HG actually added each node in the HG, from which I could also select High Priority hosts. I set a percent for warn and critical. You're right, BPI works.
So I did the same with a larger cluster with 1400 nodes in the HG. Unfortunately, BPI seems to be hung. It's been clocking since yesterday, see shot. No previously created BPI's groups, no nodes, HG's or SG's will list. Same clocking regardless of which tab I select. Attempted to resolve my cycling first my browser, then cycling nagios, mysql, httpd. When I re-open XI, go to BPI, I find the spinner still going. All other functions of XI seem fine.
What hosed BPI? Surely it can handle a 1400 member HG? How do I kill the clocking?
Re: Alarm for % (or count) of Hosts down?
Posted: Tue Sep 10, 2024 11:25 am
by bbahn
Hello @gregbeyer,
Can you check your /usr/local/nagiosxi/var/cmdsubsys.log? That should tell you what's going on.
Re: Alarm for % (or count) of Hosts down?
Posted: Tue Sep 10, 2024 11:39 am
by gregbeyer
grep'd the file for "bpi":
Mon, 09 Sep 2024 12:43:36 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:09:20 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:06 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:52 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:11:30 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 15:18:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 01:16:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 11:53:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0
Re: Alarm for % (or count) of Hosts down?
Posted: Tue Sep 10, 2024 4:06 pm
by jmichaelson
Hi Greg. Right now I'd try increasing the PHP memory limit past the default 1GB. Edit the memory_limit line in the php.ini file (its location is distribution-dependent). And then restart php-fpm on EL distros, or apache2 on Debian/Ubuntu distros.
Re: Alarm for % (or count) of Hosts down?
Posted: Wed Sep 11, 2024 2:08 pm
by snapier3
Reading this post I thought a check of this sort would be a good idea.
I put a basic python plugin together to gather the hostgroup membership info from the API and check for down hosts in the group.
The plugin gives you a percentage of hosts down for the group and evaluates it against the thresholds you provide to alert.
check_pctgroup.py
Code: Select all
import requests, sys, argparse, os, json, yaml
#DEAL WITH THE SELF SIGNED NAGIOS SSL
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#NAGIOSXI PLUGIN TO ALERT WHEN X PERCENT OF A HOSTGROUP ARE IN A DOWN STATE
#SNAPIER
#SCRIPT DEFINITION
cname = "check_pctgroup"
cversion = "0.0.1"
cpath = os.path.dirname(os.path.realpath(__file__))
##NAGIOSXI DIRECT API CALL
def nagiosxiGenericAPI(resource,endpoint,modifier,method,myurl,mykey):
#URL FOR APICALL TO NAGIOSXI
url = ("https://{turl}/nagiosxi/api/v1/{resource}/{endpoint}?{modifier}&apikey={akey}".format(turl=myurl,akey=mykey,resource=resource,endpoint=endpoint,modifier=modifier))
#ONLY ALLOW FOR USE OF GET IN THIS INSTANCE
if method == "get":
try:
r = requests.get(url=url,verify=False)
except Exception as e:
print("ERROR: %s",e)
r = False
else:
r = False
return r
##CREDENTAILS USED TO GATHER DATA VIA THE NAGIOSXI API
#PRO TIP: A UNIFIED YML CAN BE USED MULTIPLE PLUGINS
def nagiosxiAPICreds(meta):
env = meta.nenv
with open(cpath+"/check_pctgroup.yaml", "r") as yamlfile:
try:
data = yaml.safe_load(yamlfile)
r = {"url":data[0]["nagios"][env]["url"],"apikey":data[0]["nagios"][env]["apikey"]}
except Exception as e:
print("ERROR: %s",e)
r = False
finally:
return r
#STATE FROM STATEID
def checkStateFromCode(i):
switcher = {
0: "OK",
1: "WARNING",
2: "CRITICAL",
3: "UNKNOWN"
}
#GIVE THE STATE BACK
return switcher.get(i)
#NAGIOS EXIT
def nagExit(stateid,msg):
#ENRICH IF NEEDED
print(msg)
#EXIT WITH THE STATEID
sys.exit(stateid)
if __name__ == "__main__" :
#INPUT FROM NAGIOS
args = argparse.ArgumentParser(prog=cname+"v:"+cversion, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
#NAGIOSXI TARGET
args.add_argument(
"-e","--nenv",
required=True,
default=None,
help="String(nagiosenvironment): NagiosXI Instance definition stored in the yml.(dev,prd)"
),
#HOSTGROUP
#SINGLE GROUP
args.add_argument(
"--hostgroup",
required=True,
default=None,
help="String(hostgroup): NagiosXI hostgroup to evaluate."
),
args.add_argument(
"-w", "--warning",
required=False,
default=None,
help="String(warning): NagiosXI Warning Value"
)
args.add_argument(
"-c","--critical",
required=True,
default=None,
help="String(critical): NagiosXI Critical Value"
)
args.add_argument(
"-t","--timeout",
required=False,
default='30',
help="int(timeout): NagiosXI check timeout value."
)
args.add_argument(
"-p", "--perfdata",
required = False,
action = "store_true",
help="boolean(perfdata): Include NagiosXI perfdata in check output msg if enabled."
)
#PARSE ARGS
meta = args.parse_args()
#THE CHECK BODY
try:
#COLLECT THE DATA
##NAGIOS API CREDS
auth = nagiosxiAPICreds(meta)
##GET THE HOSTGROUPMEMBERS FOR THE TARGET GROUP
modhg = "&hostgroup_name={}".format(meta.hostgroup)
hostgm = nagiosxiGenericAPI("objects","hostgroupmembers",modhg,"get",auth["url"],auth["apikey"])
hd = hostgm.json()
##BUILD THE LIST
memlst = list()
totalhost = 0
members = hd["hostgroup"][0]["members"]['host']
for i in members:
memlst.append(i["host_name"])
totalhost += 1
##GET STATUS OF the LIST OF HOSTGROUP MEMBERS
nhl = ','.join(memlst)
modhgm = "&host_name=in:{}¤t_state=1".format(nhl)
hoststats = nagiosxiGenericAPI("objects","hoststatus",modhgm,"get",auth["url"],auth["apikey"])
stats = hoststats.json()
##GET THE PERCENTAGE OF DOWN HOSTS
dwn = (float(stats["recordcount"]) / totalhost * 100)
##EVALUATE THE RETURNED DATA
###FIRST IS WORSE
if(int(dwn) >= int(meta.critical)):
stateid = 2
state = checkStateFromCode(stateid)
msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))
###WARNINING SHOULD BE OPTIONAL SO HERE WE ONLY PROCESS FOR WARNING IF PRESENT
elif meta.warning and ((int(dwn) < int(meta.critical)) and (int(dwn) >= int(meta.warning))):
stateid = 1
state = checkStateFromCode(stateid)
msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))
###NOT WARNING NOT CRITICAL IT"S OK
else:
stateid = 0
state = checkStateFromCode(stateid)
msg = ('{} - All {} members of {} are UP.'.format(state,totalhost,meta.hostgroup))
###NOT EVERYONE WANTS PERFDATA (WHY?)
if meta.perfdata:
if meta.warning and meta.warning != None:
wrn = meta.warning
else:
wrn = ""
perfdata = (' | group-down-percent={}%;{};{}; group-total-count={}; group-down-count={};'.format(dwn,wrn,meta.critical,totalhost,stats["recordcount"]))
msg = msg + perfdata
#UNKNOWNS SERVE A PURPOSE (USE THEM WISELY)
except Exception as e:
stateid = 3
state = checkStateFromCode(stateid)
msg = e
#IT'S ALL ABOUT THE EXIT
finally:
nagExit(stateid,msg)
Create the check_pctgroup.yaml file and put the file in the same directory as the script.
Code: Select all
- nagios:
dev:
apikey: <your-api-key>
url: <fqdn/ip>
prd:
apikey: <your-api-key>
url: <fqdn/ip>
Command line
Code: Select all
python3 check_pctgroup.py -e dev --hostgroup "<hostgroupname>" -c (int|required) -w (int|optional) -p (optional)
Check Results
Code: Select all
OK - All 1 members of dev-linux-web are UP. | group-down-percent=0.0%;;10; group-total-count=1; group-down-count=0;
I'll throw it up on github later this week.
Happy Monitoring!
--SN
Re: Alarm for % (or count) of Hosts down?
Posted: Thu Sep 12, 2024 11:22 am
by snapier3
I put the plugin up on GitHub.
(I did change the exit message a bit)
https://github.com/SNapier/check_pctgroup
Nagios Service Config
nagios-service-config.PNG
NagiosXI Service -OK
nagios-service-exit.PNG
NagiosXI Service - CRITICAL
nagios-service-exit-crit.PNG
NagiosXI Perfdata
nagios-service-perfdata.PNG