Alarm for % (or count) of Hosts down?
Alarm for % (or count) of Hosts down?
Is it possible to create a monitor to watch for a percent of nodes going down in a hostgroup? In high performance grid computing with all it's redundancy, we don't care that one or two nodes, or event a dozen in a large cluster go down -- happens on a daily basis. But if a significant percent of a cluster goes down (we have clusters in HG's) -- that indicates trouble, and we want to know.
I've explored BPI some, but our clusters are hundreds or thousands of nodes. 1) Not practical to add each node in a cluster watch, and 2) each node is not critical anyway, so I think BPI doesn't work for this scenario.
I think what I'm wanting to do is synthetic monitor. That's what Grafana calls this. Thanks for ideas.
I've explored BPI some, but our clusters are hundreds or thousands of nodes. 1) Not practical to add each node in a cluster watch, and 2) each node is not critical anyway, so I think BPI doesn't work for this scenario.
I think what I'm wanting to do is synthetic monitor. That's what Grafana calls this. Thanks for ideas.
-
DoubleDoubleA
- Posts: 286
- Joined: Thu Feb 09, 2017 5:07 pm
Re: Alarm for % (or count) of Hosts down?
Hi @gregbeyer,
I think you might find BPI is exactly what you need. https://assets.nagios.com/downloads/nag ... BPI_v2.pdf
You should be able to add from the Hostgroup tab. You can have a BPI check that is just your cluster, and you can set it to some threshold of health, like we're ok as long as 70% of hosts are up.
Maybe take another look and reference the doc there, and check back if it still doesn't work like you'd hoped.
Aaron
I think you might find BPI is exactly what you need. https://assets.nagios.com/downloads/nag ... BPI_v2.pdf
You should be able to add from the Hostgroup tab. You can have a BPI check that is just your cluster, and you can set it to some threshold of health, like we're ok as long as 70% of hosts are up.
Maybe take another look and reference the doc there, and check back if it still doesn't work like you'd hoped.
Aaron
Re: Alarm for % (or count) of Hosts down?
Hello @gregbeyer,
If you add your hostgroups/servicegroups to BPI and just hit edit on the groups in question, you can hit edit then set a threshold for each hostgroup/servicegroup.
If you have too many hostgroups/servicegroups and don't want to do this by hand, sync your hostgroups/servicegroups with BPI and then click the settings icon in BPI (the cog). You will see a BPI configuration file (default /usr/local/nagiosxi/etc/components/bpi.conf). You can use a sed command or whatever script you prefer to mass-edit this file to change the config items from the defaults like the following:
to whatever warning and critical thresholds you would like. (warning_threshold=50 means 50%)
Note that these are health thresholds and the critical value should be lower than the warning value.
If you add your hostgroups/servicegroups to BPI and just hit edit on the groups in question, you can hit edit then set a threshold for each hostgroup/servicegroup.
If you have too many hostgroups/servicegroups and don't want to do this by hand, sync your hostgroups/servicegroups with BPI and then click the settings icon in BPI (the cog). You will see a BPI configuration file (default /usr/local/nagiosxi/etc/components/bpi.conf). You can use a sed command or whatever script you prefer to mass-edit this file to change the config items from the defaults like the following:
Code: Select all
define hg_linux-servers {
title=HG: linux-servers
desc=
primary=1
info=
members=localhost;NULL;&,
warning_threshold=0
critical_threshold=0
priority=0
type=hostgroup
auth_users=
}
Note that these are health thresholds and the critical value should be lower than the warning value.
Actively advancing awesome answers with ardent alliteration, aptly addressing all ambiguities. Amplify your acumen and avail our amicable assistance. Eagerly awaiting your astute assessments of our advice.
Re: Alarm for % (or count) of Hosts down?
If BPI isn't what you want you can write a plugin (script) to do a few SQL queries a bit of math and create an alert.
Re: Alarm for % (or count) of Hosts down?
After taking another look at BPI and with @DoubleDoubleA 's suggestion on how to use HG, I tried it with one of my smaller clusters. Adding the HG actually added each node in the HG, from which I could also select High Priority hosts. I set a percent for warn and critical. You're right, BPI works.
So I did the same with a larger cluster with 1400 nodes in the HG. Unfortunately, BPI seems to be hung. It's been clocking since yesterday, see shot. No previously created BPI's groups, no nodes, HG's or SG's will list. Same clocking regardless of which tab I select. Attempted to resolve my cycling first my browser, then cycling nagios, mysql, httpd. When I re-open XI, go to BPI, I find the spinner still going. All other functions of XI seem fine.
What hosed BPI? Surely it can handle a 1400 member HG? How do I kill the clocking?
So I did the same with a larger cluster with 1400 nodes in the HG. Unfortunately, BPI seems to be hung. It's been clocking since yesterday, see shot. No previously created BPI's groups, no nodes, HG's or SG's will list. Same clocking regardless of which tab I select. Attempted to resolve my cycling first my browser, then cycling nagios, mysql, httpd. When I re-open XI, go to BPI, I find the spinner still going. All other functions of XI seem fine.
What hosed BPI? Surely it can handle a 1400 member HG? How do I kill the clocking?
You do not have the required permissions to view the files attached to this post.
Re: Alarm for % (or count) of Hosts down?
Hello @gregbeyer,
Can you check your /usr/local/nagiosxi/var/cmdsubsys.log? That should tell you what's going on.
Can you check your /usr/local/nagiosxi/var/cmdsubsys.log? That should tell you what's going on.
Actively advancing awesome answers with ardent alliteration, aptly addressing all ambiguities. Amplify your acumen and avail our amicable assistance. Eagerly awaiting your astute assessments of our advice.
Re: Alarm for % (or count) of Hosts down?
grep'd the file for "bpi":
Mon, 09 Sep 2024 12:43:36 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:09:20 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:06 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:52 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:11:30 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 15:18:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 01:16:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 11:53:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0
Mon, 09 Sep 2024 12:43:36 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:09:20 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:06 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:10:52 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 13:11:30 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 15:18:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Mon, 09 Sep 2024 17:04:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 01:16:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 10 Sep 2024 11:53:09 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0
- jmichaelson
- Posts: 383
- Joined: Wed Aug 23, 2023 1:02 pm
Re: Alarm for % (or count) of Hosts down?
Hi Greg. Right now I'd try increasing the PHP memory limit past the default 1GB. Edit the memory_limit line in the php.ini file (its location is distribution-dependent). And then restart php-fpm on EL distros, or apache2 on Debian/Ubuntu distros.
Please let us know if you have any other questions or concerns.
-Jason
-Jason
Re: Alarm for % (or count) of Hosts down?
Reading this post I thought a check of this sort would be a good idea.
I put a basic python plugin together to gather the hostgroup membership info from the API and check for down hosts in the group.
The plugin gives you a percentage of hosts down for the group and evaluates it against the thresholds you provide to alert.
check_pctgroup.py
Create the check_pctgroup.yaml file and put the file in the same directory as the script.
Command line
Check Results
I'll throw it up on github later this week.
Happy Monitoring!
--SN
I put a basic python plugin together to gather the hostgroup membership info from the API and check for down hosts in the group.
The plugin gives you a percentage of hosts down for the group and evaluates it against the thresholds you provide to alert.
check_pctgroup.py
Code: Select all
import requests, sys, argparse, os, json, yaml
#DEAL WITH THE SELF SIGNED NAGIOS SSL
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
#NAGIOSXI PLUGIN TO ALERT WHEN X PERCENT OF A HOSTGROUP ARE IN A DOWN STATE
#SNAPIER
#SCRIPT DEFINITION
cname = "check_pctgroup"
cversion = "0.0.1"
cpath = os.path.dirname(os.path.realpath(__file__))
##NAGIOSXI DIRECT API CALL
def nagiosxiGenericAPI(resource,endpoint,modifier,method,myurl,mykey):
#URL FOR APICALL TO NAGIOSXI
url = ("https://{turl}/nagiosxi/api/v1/{resource}/{endpoint}?{modifier}&apikey={akey}".format(turl=myurl,akey=mykey,resource=resource,endpoint=endpoint,modifier=modifier))
#ONLY ALLOW FOR USE OF GET IN THIS INSTANCE
if method == "get":
try:
r = requests.get(url=url,verify=False)
except Exception as e:
print("ERROR: %s",e)
r = False
else:
r = False
return r
##CREDENTAILS USED TO GATHER DATA VIA THE NAGIOSXI API
#PRO TIP: A UNIFIED YML CAN BE USED MULTIPLE PLUGINS
def nagiosxiAPICreds(meta):
env = meta.nenv
with open(cpath+"/check_pctgroup.yaml", "r") as yamlfile:
try:
data = yaml.safe_load(yamlfile)
r = {"url":data[0]["nagios"][env]["url"],"apikey":data[0]["nagios"][env]["apikey"]}
except Exception as e:
print("ERROR: %s",e)
r = False
finally:
return r
#STATE FROM STATEID
def checkStateFromCode(i):
switcher = {
0: "OK",
1: "WARNING",
2: "CRITICAL",
3: "UNKNOWN"
}
#GIVE THE STATE BACK
return switcher.get(i)
#NAGIOS EXIT
def nagExit(stateid,msg):
#ENRICH IF NEEDED
print(msg)
#EXIT WITH THE STATEID
sys.exit(stateid)
if __name__ == "__main__" :
#INPUT FROM NAGIOS
args = argparse.ArgumentParser(prog=cname+"v:"+cversion, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
#NAGIOSXI TARGET
args.add_argument(
"-e","--nenv",
required=True,
default=None,
help="String(nagiosenvironment): NagiosXI Instance definition stored in the yml.(dev,prd)"
),
#HOSTGROUP
#SINGLE GROUP
args.add_argument(
"--hostgroup",
required=True,
default=None,
help="String(hostgroup): NagiosXI hostgroup to evaluate."
),
args.add_argument(
"-w", "--warning",
required=False,
default=None,
help="String(warning): NagiosXI Warning Value"
)
args.add_argument(
"-c","--critical",
required=True,
default=None,
help="String(critical): NagiosXI Critical Value"
)
args.add_argument(
"-t","--timeout",
required=False,
default='30',
help="int(timeout): NagiosXI check timeout value."
)
args.add_argument(
"-p", "--perfdata",
required = False,
action = "store_true",
help="boolean(perfdata): Include NagiosXI perfdata in check output msg if enabled."
)
#PARSE ARGS
meta = args.parse_args()
#THE CHECK BODY
try:
#COLLECT THE DATA
##NAGIOS API CREDS
auth = nagiosxiAPICreds(meta)
##GET THE HOSTGROUPMEMBERS FOR THE TARGET GROUP
modhg = "&hostgroup_name={}".format(meta.hostgroup)
hostgm = nagiosxiGenericAPI("objects","hostgroupmembers",modhg,"get",auth["url"],auth["apikey"])
hd = hostgm.json()
##BUILD THE LIST
memlst = list()
totalhost = 0
members = hd["hostgroup"][0]["members"]['host']
for i in members:
memlst.append(i["host_name"])
totalhost += 1
##GET STATUS OF the LIST OF HOSTGROUP MEMBERS
nhl = ','.join(memlst)
modhgm = "&host_name=in:{}¤t_state=1".format(nhl)
hoststats = nagiosxiGenericAPI("objects","hoststatus",modhgm,"get",auth["url"],auth["apikey"])
stats = hoststats.json()
##GET THE PERCENTAGE OF DOWN HOSTS
dwn = (float(stats["recordcount"]) / totalhost * 100)
##EVALUATE THE RETURNED DATA
###FIRST IS WORSE
if(int(dwn) >= int(meta.critical)):
stateid = 2
state = checkStateFromCode(stateid)
msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))
###WARNINING SHOULD BE OPTIONAL SO HERE WE ONLY PROCESS FOR WARNING IF PRESENT
elif meta.warning and ((int(dwn) < int(meta.critical)) and (int(dwn) >= int(meta.warning))):
stateid = 1
state = checkStateFromCode(stateid)
msg = ('{} - Hostgroup {} has {}% members down.'.format(state,meta.hostgroup,dwn))
###NOT WARNING NOT CRITICAL IT"S OK
else:
stateid = 0
state = checkStateFromCode(stateid)
msg = ('{} - All {} members of {} are UP.'.format(state,totalhost,meta.hostgroup))
###NOT EVERYONE WANTS PERFDATA (WHY?)
if meta.perfdata:
if meta.warning and meta.warning != None:
wrn = meta.warning
else:
wrn = ""
perfdata = (' | group-down-percent={}%;{};{}; group-total-count={}; group-down-count={};'.format(dwn,wrn,meta.critical,totalhost,stats["recordcount"]))
msg = msg + perfdata
#UNKNOWNS SERVE A PURPOSE (USE THEM WISELY)
except Exception as e:
stateid = 3
state = checkStateFromCode(stateid)
msg = e
#IT'S ALL ABOUT THE EXIT
finally:
nagExit(stateid,msg)
Code: Select all
- nagios:
dev:
apikey: <your-api-key>
url: <fqdn/ip>
prd:
apikey: <your-api-key>
url: <fqdn/ip>
Command line
Code: Select all
python3 check_pctgroup.py -e dev --hostgroup "<hostgroupname>" -c (int|required) -w (int|optional) -p (optional)
Code: Select all
OK - All 1 members of dev-linux-web are UP. | group-down-percent=0.0%;;10; group-total-count=1; group-down-count=0;
Happy Monitoring!
--SN
Re: Alarm for % (or count) of Hosts down?
I put the plugin up on GitHub.
(I did change the exit message a bit)
https://github.com/SNapier/check_pctgroup
Nagios Service Config NagiosXI Service -OK NagiosXI Service - CRITICAL NagiosXI Perfdata
(I did change the exit message a bit)
https://github.com/SNapier/check_pctgroup
Nagios Service Config NagiosXI Service -OK NagiosXI Service - CRITICAL NagiosXI Perfdata
You do not have the required permissions to view the files attached to this post.