Create alert when two VM hosts are online simultaneously

rootintootin · Post by **rootintootin** » Mon Jun 15, 2020 10:39 am

TLDR: Trying to create some service or command that would alert if check_ping simultaneously returned a critical or OK on two different VMs running from two different hypervisors. Is this doable in Nagios? If so where should I start?

I admin over a testbed environment where a single controller VM handles much of the hardware components inside. We're currently moving into a position where we want to test a new controller VM but we never want both controllers active simultaneously. I was asked if it was possible to get my current Nagios setup to send out an alert if we ever have both controller VMs online at the same time. Each of these controllers are Linux KVM VMs (Debian 8 and Debian 10) running in two separate Linux hypervisors. Both VMs are on the same separate LAN (192.168.60) from our Nagios host (192.168.4). We generally SSH in by proxying through a bastion host and IPTables on the controllers are locked down so SSH is only allowed from the hypervisor (This IPtables lockdown is not an especially desirable feature in this scenario but it's done to emulate our real world environment). I bring this up because I think installing NRPE on the controller VMs is not viable, both for IPTables and policy reasons. However ping (by IP) works for both hypervisors and both controllers from the bastion host.

My main idea that I think would be the simplest assumes that I can create some kind of logic in an NRPE service definition that would execute an NRPE command on our bastion host to check_ping on both controller VMs and setup some kind of exclusive or logic so an alert is generated if both VMs return check_ping as OK or both VMs return check_ping CRITICAL. Is this doable in Nagios? I've been googling around but I haven't found too many hits on creating a critical (or more specifically an alert message) only when two checks return critical (or, ideally, both critical OR both OK)

Another possible route is using some kind of virsh nrpe plugin that I imagine I might have to write myself. The major issue with this is our hypervisors are in a specific state that matches our real-world environment and I would absolutely not want to install anything on them unless there was no alternative (and then this whole idea might just get scrapped). And, of course, NRPE 5666 is not opened on these VMs and it would be a significant feat convincing management to alter configuration away from our real-world setup for any reason.

Any advice, recommendations, or further reading would be greatly appreciated. Thanks so much!

rootintootin · Post by **rootintootin** » Tue Jun 16, 2020 8:58 am

As an update I just wrote a bash script to be executed as an NRPE plugin that works at the command line but always returns the same value if run from check_nrpe. The script uses an initial if check with ping to assign binary values to whether or not a host is online then uses a second set of nested if statements to return both the exit values for critical/OK/unknown and the status information on the Nagios webpage.

This if check works at the command line but if I use check_nrpe -H <NRPEHost> -c "controller_check" both VMs are always set to 0 and, as such, the if statement below always returns "CRITICAL -- Both testbed controllers are offline." Am I not allowed to use ping from check_nrpe? I also tried /usr/bin/ping just in case but it made no difference.

Code: Select all

if ping -c 1 -W 1 $VM1HOSTNAME; then
  VM1=1
else
  VM1=0
fi

if ping -c 1 -W 1 $VM2HOSTNAME; then
  VM2=1
else
  VM2=0
fi

Using just ping (or /usr/bin/ping) the values returned by the above if check are always 0:

Code: Select all

./check_nrpe -H <NRPEHost> -c "check_testbed_controller_status"
VM1HOSTNAME = 0 VM2HOSTNAME = 0
CRITICAL - Both testbed controllers currently offline

What I've noticed is that if I use check_ping inside this plugin the values return correctly but the status information that shows up on the Nagios webpage just gives the first ping checks return values rather than the final echo statement. Is there a way to either A) Use ping via check_nrpe inside this script or B) use check_nrpe, remove the "PING OK" lines and only return the final line so the status_information is properly updated in Nagios?

Code: Select all

./check_nrpe -H <NRPEHOST> -c "check_testbed_controller_status"
PING OK - Packet loss = 0%, RTA = 0.81 ms|rta=0.811000ms;10.000000;20.000000;0.000000 pl=0%;2;5;0
PING OK - Packet loss = 0%, RTA = 0.79 ms|rta=0.787000ms;10.000000;20.000000;0.000000 pl=0%;2;5;0
CRITICAL - Both testbed controllers currently online.

rootintootin · Post by **rootintootin** » Tue Jun 16, 2020 11:33 am

A final update/solution:

Turns out SELinux was the culprit for /usr/bin/ping not working (shocker). Rather than bother trying to write some SELinux policy to allow the NRPE daemon to execute /usr/bin/ping I instead opted to use check_ping and just redirected the output to /dev/null. I initially tried to pipe sed through the command.cfg definition

Code: Select all

command_line    $USER1$/check_nrpe -2 -H $HOSTADDRESS$ -c "check_testbed_controller_status" | sed \'\/^PING\/d\'

but this didn't seem to work (as I'm writing this I can't remember if I restarted nagios after updating the command definition...). Regardless redirecting stdout to /dev/null should have been my first idea though and worked just fine.

Final plugin used this logic:

Code: Select all

if $NRPEPING -H $VM1HOSTNAME -w 10,2% -c 20,5% > /dev/null 2>&1; then
  VM1=1
else
  VM1=0
fi

if $NRPEPING -H $VM2HOSTNAME -w 10,2% -c 20,5% > /dev/null 2>&1; then
  VM2=1
else
  VM2=0
fi

Post by **cdienger** » Fri Jun 19, 2020 4:27 pm

Hi @rootintootin and welcome to the forums!

Thanks for sharing your progress. It seems like you have a solution but I thought I'd point out the check_cluster plugin since that may be fit as well or help some trying to do something similar:

https://assets.nagios.com/downloads/nag ... sters.html

The thresholds it can take are covered in:

https://www.nagios-plugins.org/doc/guid ... HOLDFORMAT

Something like this it would be easy set up a check to see if both hosts are in a non-OK state and another check to make sure that 1 is OK but the other is not:

Code: Select all

-c 1
-c 1:

The first one would return critical if both return a non-OK state and the second would return critical if less than 1 machine returned a non-OK state.

Nagios Support Forum

Create alert when two VM hosts are online simultaneously

Create alert when two VM hosts are online simultaneously

Re: Create alert when two VM hosts are online simultaneously

Re: Create alert when two VM hosts are online simultaneously

Re: Create alert when two VM hosts are online simultaneously