Alarm for % (or count) of Hosts down?
-
- Posts: 123
- Joined: Thu Feb 09, 2017 5:07 pm
Re: Alarm for % (or count) of Hosts down?
Hi @jmichaelson. I upped memory_limit first to 2048 no joy, then 4096, still clocking. Cycled all services. Then checked cmdsubsys.log, found this:
Mon, 16 Sep 2024 14:24:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 17 Sep 2024 08:45:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e8eb6c.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315
Tue, 17 Sep 2024 08:56:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e9dd1f.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315
Doesn't correspond to when I made the change to memory or when I attempted (its's still clocking) to create the new large BPI today. But there it is.
BTW BPI stopped clocking on its own sometime after my post last Tuesday. Maybe timed out (very long time out) I tried, but couldn't do anything to kill it. Even cycling php-fpm doesn't kill it. But with attempting to create a new BPI, same issue, but no mention of memory this time.
I haven't tried out your plugin yet, @snapier3. Thanks for that, looks nice!
Mon, 16 Sep 2024 14:24:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 17 Sep 2024 08:45:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e8eb6c.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315
Tue, 17 Sep 2024 08:56:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e9dd1f.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315
Doesn't correspond to when I made the change to memory or when I attempted (its's still clocking) to create the new large BPI today. But there it is.
BTW BPI stopped clocking on its own sometime after my post last Tuesday. Maybe timed out (very long time out) I tried, but couldn't do anything to kill it. Even cycling php-fpm doesn't kill it. But with attempting to create a new BPI, same issue, but no mention of memory this time.
I haven't tried out your plugin yet, @snapier3. Thanks for that, looks nice!
Re: Alarm for % (or count) of Hosts down?
BPI is still clocking. How the heck do I kill it?? I stopped httpd, mysqld, nagios and php-fpm, closed browser. Started all, opened browser, cleared cache -- still clocking.
grep bpi /usr/local/nagiosxi/var/cmdsubsys.log:
Wed, 18 Sep 2024 14:33:05 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0
/etc/php.ini : memory_limit = 4096M 4096 isn't enough for BPI?
grep bpi /usr/local/nagiosxi/var/cmdsubsys.log:
Wed, 18 Sep 2024 14:33:05 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0
/etc/php.ini : memory_limit = 4096M 4096 isn't enough for BPI?
Re: Alarm for % (or count) of Hosts down?
Hi, @snapier3. Thanks very much for this great plugin! If they had a "homage" emoji, I'd use it. Here's a little feedback. I've installed and configured command and service for five hostgroups. For three of them it is working great! But two HGs, when I run the command from cli or within XI are returning an error:
$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "phoenix_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)
$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "hive_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)
So what's different between the HG's? The ones that work have 49, 122 and 146 members. The two failing are much larger. phoenix has 1,414 members, hive has 488. Just spitballing here -- I'm wondering if the check times-out for the larger HG's? I don't have other HGs in between 146 and 488 members, so I don't know what the "break point" is.
I've spot checked a few smaller than 146 HG -- all successful. Checking all of the larger, above 488 members - fail with the same error. I've got 46 HG's so haven't checked them all, but I think I see a trend.
I cloned all services from my first successful one, just changing ARG2 for the HG and a managed host in the HG as appropriate, so I know I didn't fat finger something.
One little niggle with the output that I wonder if it could be tweaked? I think only like one decimal place of % rather than 15 is needed:
OK - Hostgroup FIREBIRD_CLUSTER-HG has 0.684931506849315% of 146 members down. (that's one node down)
Again, thanks for this plugin -- I'm sure other Nagios admins are going to find this useful. Also, the perf data is very useful, because an admin can easily see the waning/waxing of %available over time, which will be informative of events or issues that need to be examined. Not sure yet that even BPI will give this insight. Kudo's!
$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "phoenix_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)
$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "hive_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)
So what's different between the HG's? The ones that work have 49, 122 and 146 members. The two failing are much larger. phoenix has 1,414 members, hive has 488. Just spitballing here -- I'm wondering if the check times-out for the larger HG's? I don't have other HGs in between 146 and 488 members, so I don't know what the "break point" is.
I've spot checked a few smaller than 146 HG -- all successful. Checking all of the larger, above 488 members - fail with the same error. I've got 46 HG's so haven't checked them all, but I think I see a trend.
I cloned all services from my first successful one, just changing ARG2 for the HG and a managed host in the HG as appropriate, so I know I didn't fat finger something.
One little niggle with the output that I wonder if it could be tweaked? I think only like one decimal place of % rather than 15 is needed:
OK - Hostgroup FIREBIRD_CLUSTER-HG has 0.684931506849315% of 146 members down. (that's one node down)
Again, thanks for this plugin -- I'm sure other Nagios admins are going to find this useful. Also, the perf data is very useful, because an admin can easily see the waning/waxing of %available over time, which will be informative of events or issues that need to be examined. Not sure yet that even BPI will give this insight. Kudo's!
Re: Alarm for % (or count) of Hosts down?
I wrote that pretty quick and to be honest I think it's the way I'm doing the status check for the group members. I'm passing it in a list so that many hosts is probably breaking the show as written.So what's different between the HG's? The ones that work have 49, 122 and 146 members. The two failing are much larger. phoenix has 1,414 members, hive has 488. Just spitballing here -- I'm wondering if the check times-out for the larger HG's? I don't have other HGs in between 146 and 488 members, so I don't know what the "break point" is.
I will need to make that more efficient in getting hostgroup members and status considering the group sizes.
I have a couple of ideas there, let me noodle on this a little bit and I'll make an update after I finish testing the new NCPA.
Re: Alarm for % (or count) of Hosts down?
@Gregbeyer
I made some tweaks to the plugin based on your feedback.
Hopefully this will improve the performance a little bit with the large group member counts.
Also fixed the float
The latest version is up on GitHub.
I made some tweaks to the plugin based on your feedback.
Hopefully this will improve the performance a little bit with the large group member counts.
Also fixed the float
The latest version is up on GitHub.
Re: Alarm for % (or count) of Hosts down?
Good morning @snapier3. Good news and not so good (wouldn't call it "bad" )
Now, my two large hostgroups (clusters) of 1400 and 480 are returning results. Service status successfully counts the number of nodes in the HG. Yeah!
Buut, the % down isn't accurate, to wit, phoenix has 2 nodes down, yet the plugin reported as 0.0% down:
So that HG at present has two of 1416 node down, which is 0.0014 % down (I know, ee-gads!)
So I thought to adjust the rounding line in your code dwnpct = round(dwn, 2), increase it to 4, then 6, then 8 decimal places. Still reports 0.0% down, even if I go up to 15 decimals.
Same false report of 0.0% down on a smaller HG, too.
Knowing now how you rounded, I tried to inform myself on what could be causing this miscalculation, and found this resource. https://blog.finxter.com/5-best-ways-to ... in-python/ What do you think of these quantize, math or int examples? Or maybe in my ignorance of python, I'm making this harder than it needs to be.
Any-hoo, plugin's looking better but still has a bit of a bug.
Now, my two large hostgroups (clusters) of 1400 and 480 are returning results. Service status successfully counts the number of nodes in the HG. Yeah!
Buut, the % down isn't accurate, to wit, phoenix has 2 nodes down, yet the plugin reported as 0.0% down:
So that HG at present has two of 1416 node down, which is 0.0014 % down (I know, ee-gads!)
So I thought to adjust the rounding line in your code dwnpct = round(dwn, 2), increase it to 4, then 6, then 8 decimal places. Still reports 0.0% down, even if I go up to 15 decimals.
Same false report of 0.0% down on a smaller HG, too.
Knowing now how you rounded, I tried to inform myself on what could be causing this miscalculation, and found this resource. https://blog.finxter.com/5-best-ways-to ... in-python/ What do you think of these quantize, math or int examples? Or maybe in my ignorance of python, I'm making this harder than it needs to be.
Any-hoo, plugin's looking better but still has a bit of a bug.
You do not have the required permissions to view the files attached to this post.
Last edited by gregbeyer on Mon Sep 23, 2024 2:30 pm, edited 1 time in total.
Re: Alarm for % (or count) of Hosts down?
Thanks for the feedback!
I still had some whack in the script, tweaked it a little.
I still had some whack in the script, tweaked it a little.