Page 1 of 1
monitoring with multiple monitors and reports to one
Posted: Thu Mar 24, 2016 1:02 pm
So we're running into an issue that isn't likely going to be fixed soon....or ever that limits us to reach certain hosts within our subnet. This happens in multiple sites that we have. So basically what that means is we have services/hosts reporting down when in reality they aren't. However, I can reach that host/servicce from a DIFFERENT site just fine. It'd be nice to have multiple monitors check the services and come back and say "hey I report this being down" the other says" hey I report this being up" and it comes back to the single point as up and vice versa. Then if they BOTH/ALL report a service/host being down....it is in fact down.
tl;dr site A can't check on site A, Site B can't check site B, but site B can check site A and vice versa. Would be nice to have multiple monitors report to one single instance and take the positive result from each and if both negative page out, email, etc.
Re: monitoring with multiple monitors and reports to one
Posted: Thu Mar 24, 2016 3:39 pm
by rkennedy
I don't think your post requires a TL;DR - not that bad!
To clarify, why are the host/services reporting down? Do you have a firewall / ACL's applied? Nagios will need a route directly to each host you'd like to check.
An alternative, is using check_nrpe at a remote site, and having your checks run through there. Pretty much using the agent as an intermediary.
You can combine this with BPI to gain insight from multiple locations, for example -
Checking the host 1.2.3.4
BPI Group -
Nagios XI - checks, critical
London (NRPE agent) - checks, critical
Dallas (NRPE agent) - checks, OK
You can tell BPI if 1/3 of those checks is OK, then it is OK. See this document for a further explanation
https://assets.nagios.com/downloads/nag ... BPI_v2.pdf
Re: monitoring with multiple monitors and reports to one
Posted: Fri Mar 25, 2016 9:44 am
rkennedy wrote:I don't think your post requires a TL;DR - not that bad!
To clarify, why are the host/services reporting down? Do you have a firewall / ACL's applied? Nagios will need a route directly to each host you'd like to check.
An alternative, is using check_nrpe at a remote site, and having your checks run through there. Pretty much using the agent as an intermediary.
You can combine this with BPI to gain insight from multiple locations, for example -
Checking the host 1.2.3.4
BPI Group -
Nagios XI - checks, critical
London (NRPE agent) - checks, critical
Dallas (NRPE agent) - checks, OK
You can tell BPI if 1/3 of those checks is OK, then it is OK. See this document for a further explanation
https://assets.nagios.com/downloads/nag ... BPI_v2.pdf
There's a known issue when checking, particularly, global site selectors within our own subnet. Most hosts and service checks work( even custom ones we've built), but those are our largest issues. This is something that won't get fixed. So it's like ok fine we'll move the monitor host to another subnet......great we can now monitor everything on the old subnet.....crap now we can't monitor anything on the new subnet. Balls! There's no ACL, or FW rules blocking this. Hell there's even an ACL rule saying allow traffic from subnet A to subnet A and quit being a dick about it. However, that still doesn't work. We do use NRPE and that works for 90% of the solution except when trying to monitor if virtual IP's are accessible by ping.....cause once again we can't ping within our own subnet. Crap!
Re: monitoring with multiple monitors and reports to one
Posted: Fri Mar 25, 2016 2:09 pm
by ssax
In the zone that you cannot reach, can it reach out to your XI server (using passive checks). If that would work then you can use BPI or the check_cluster command to make it work:
https://assets.nagios.com/downloads/nag ... sters.html
Re: monitoring with multiple monitors and reports to one
Posted: Mon Mar 28, 2016 8:39 am
They're virtual IP's for load balancing. But regardless of that fact, no we can't reach within our subnet, at all. So anything we can't get NRPE installed on, such as a VIP, we can't do any checks on. Which all they are is ping checks anyways, since they are just load balancing IPs.
Re: monitoring with multiple monitors and reports to one
Posted: Mon Mar 28, 2016 3:58 pm
by bwallace
What do these load balancers use for their health checks? Might we just be able to have Nagios monitor those health check targets instead?
Re: monitoring with multiple monitors and reports to one
Posted: Tue Apr 05, 2016 10:13 am
bwallace wrote:What do these load balancers use for their health checks? Might we just be able to have Nagios monitor those health check targets instead?
We have an "internal" home grown monitoring tool that checks the health of the service, completely, but not an individual node like a global site selector virtual IP. We do actually monitor THAT monitor for other services and page out if that goes down. The trick here is we want nagios to let us know before our internal tool does. Cause if that notifies us a P1/P2 is opened and we have to fill out a bunch of crap, host a meeting how/why we're going to avoid this in the future....blah blah blah tons of hassle. HAH! However, if nagios notifies us and we fix the issue before anything is actually impacted then our work is complete and we go on happy and healthy.

Re: monitoring with multiple monitors and reports to one
Posted: Tue Apr 05, 2016 4:17 pm
by rkennedy
I think check_by_ssh would work for this. I assume this way, you'll be able to setup specific checks geared towards which servers can hit which networks. It also saves the need to configure NRPE on multiple hosts.
Would this be an option for you?
Re: monitoring with multiple monitors and reports to one
Posted: Tue Apr 05, 2016 4:22 pm
by bwallace
In addition to check_by_ssh, what about check_tcp?
If I understand correctly, ICMP is disabled on their respective subnets which is why ping checks won't work of course, right?
Looking around, I see there is a check_tcp available which might work since it: 1) doesn't use ICMP 2) just tries to establish a tcp connection, and that's it.
If you can find way to get check_tcp running on/against these VIPS, then you can try implementing what rkennedy proposed earlier which was to group these VIPS in BPI where you can then define a 'real-world' status based on whatever check result combinations you specify.
Check out the check_tcp man page:
http://nagios-plugins.org/doc/man/check_tcp.html