Alarm for % (or count) of Hosts down?

DoubleDoubleA · Post by **DoubleDoubleA** » Thu Sep 12, 2024 1:40 pm

Nice!

gregbeyer · Post by **gregbeyer** » Tue Sep 17, 2024 2:59 pm

Hi @jmichaelson. I upped memory_limit first to 2048 no joy, then 4096, still clocking. Cycled all services. Then checked cmdsubsys.log, found this:

Mon, 16 Sep 2024 14:24:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
Tue, 17 Sep 2024 08:45:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e8eb6c.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315
Tue, 17 Sep 2024 08:56:01 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Warning: unlink(/usr/local/nagiosxi/etc/components/bpi/66df2c1e9dd1f.conf): No such file or directory in /usr/local/nagiosxi/html/includes/components/nagiosbpi/functions/manage_config.inc.php on line 315

Doesn't correspond to when I made the change to memory or when I attempted (its's still clocking) to create the new large BPI today.

But there it is.

BTW BPI stopped clocking on its own sometime after my post last Tuesday. Maybe timed out (very long time out) I tried, but couldn't do anything to kill it. Even cycling php-fpm doesn't kill it. But with attempting to create a new BPI, same issue, but no mention of memory this time.

I haven't tried out your plugin yet, @snapier3. Thanks for that, looks nice!

gregbeyer · Post by **gregbeyer** » Thu Sep 19, 2024 9:50 am

BPI is still clocking. How the heck do I kill it?? I stopped httpd, mysqld, nagios and php-fpm, closed browser. Started all, opened browser, cleared cache -- still clocking.

grep bpi /usr/local/nagiosxi/var/cmdsubsys.log:

Wed, 18 Sep 2024 14:33:05 -0400 CMDLINE=php /usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php --cmd=syncall
PHP Fatal error: Allowed memory size of 4294967296 bytes exhausted (tried to allocate 262144 bytes) in /usr/local/nagiosxi/html/includes/components/nagiosbpi/classes/BpGroup_class.php on line 0

/etc/php.ini : memory_limit = 4096M 4096 isn't enough for BPI?

gregbeyer · Post by **gregbeyer** » Thu Sep 19, 2024 4:27 pm

Hi, @snapier3. Thanks very much for this great plugin! If they had a "homage" emoji, I'd use it.

Here's a little feedback. I've installed and configured command and service for five hostgroups. For three of them it is working great! But two HGs, when I run the command from cli or within XI are returning an error:

$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "phoenix_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)

$ python3 /usr/local/nagios/libexec/check_pctgroup.py -e prd --hostgroup "hive_cluster-hg" -c 5 -w 10 -p
Expecting value: line 1 column 1 (char 0)

So what's different between the HG's? The ones that work have 49, 122 and 146 members. The two failing are much larger. phoenix has 1,414 members, hive has 488. Just spitballing here -- I'm wondering if the check times-out for the larger HG's? I don't have other HGs in between 146 and 488 members, so I don't know what the "break point" is.

I've spot checked a few smaller than 146 HG -- all successful. Checking all of the larger, above 488 members - fail with the same error. I've got 46 HG's so haven't checked them all, but I think I see a trend.

I cloned all services from my first successful one, just changing ARG2 for the HG and a managed host in the HG as appropriate, so I know I didn't fat finger something.

One little niggle with the output that I wonder if it could be tweaked? I think only like one decimal place of % rather than 15 is needed:

OK - Hostgroup FIREBIRD_CLUSTER-HG has 0.684931506849315% of 146 members down. (that's one node down)

Again, thanks for this plugin -- I'm sure other Nagios admins are going to find this useful. Also, the perf data is very useful, because an admin can easily see the waning/waxing of %available over time, which will be informative of events or issues that need to be examined. Not sure yet that even BPI will give this insight. Kudo's!

snapier3 · Post by **snapier3** » Thu Sep 19, 2024 10:23 pm

So what's different between the HG's? The ones that work have 49, 122 and 146 members. The two failing are much larger. phoenix has 1,414 members, hive has 488. Just spitballing here -- I'm wondering if the check times-out for the larger HG's? I don't have other HGs in between 146 and 488 members, so I don't know what the "break point" is.

I wrote that pretty quick and to be honest I think it's the way I'm doing the status check for the group members. I'm passing it in a list so that many hosts is probably breaking the show as written.

I will need to make that more efficient in getting hostgroup members and status considering the group sizes.
I have a couple of ideas there, let me noodle on this a little bit and I'll make an update after I finish testing the new NCPA.

snapier3 · Post by **snapier3** » Fri Sep 20, 2024 2:52 pm

@Gregbeyer
I made some tweaks to the plugin based on your feedback.
Hopefully this will improve the performance a little bit with the large group member counts.
Also fixed the float

The latest version is up on GitHub.

gregbeyer · Post by **gregbeyer** » Mon Sep 23, 2024 11:18 am

Good morning @snapier3. Good news and not so good (wouldn't call it "bad"

)

Now, my two large hostgroups (clusters) of 1400 and 480 are returning results. Service status successfully counts the number of nodes in the HG. Yeah!

Buut, the % down isn't accurate, to wit, phoenix has 2 nodes down, yet the plugin reported as 0.0% down:

pctdown.png

So that HG at present has two of 1416 node down, which is 0.0014 % down (I know, ee-gads!)

So I thought to adjust the rounding line in your code dwnpct = round(dwn, 2), increase it to 4, then 6, then 8 decimal places. Still reports 0.0% down, even if I go up to 15 decimals.

Same false report of 0.0% down on a smaller HG, too.

Knowing now how you rounded, I tried to inform myself on what could be causing this miscalculation, and found this resource. https://blog.finxter.com/5-best-ways-to ... in-python/ What do you think of these quantize, math or int examples? Or maybe in my ignorance of python, I'm making this harder than it needs to be.

Any-hoo, plugin's looking better but still has a bit of a bug.

snapier3 · Post by **snapier3** » Mon Sep 23, 2024 1:55 pm

Thanks for the feedback!
I still had some whack in the script, tweaked it a little.

Nagios Support Forum

Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?

Re: Alarm for % (or count) of Hosts down?