Global Distributed Monitoring with Failover

nagios_321 · Post by **nagios_321** » Fri Jun 10, 2016 6:55 am

Hi,

I am interested to hear your suggestions for our Nagios XI implementation. I have done a fair amount of reading, but I’d just like to hear peoples opinions and ideas.

*We’ll need to monitor >2500 hosts with a mixture of Windows and Linux hosts. As well as storage and Network devices. A mixture of physical and Virtual hosts.
*We have a global datacentre presence. With 1000 hosts in a primary UK datacentre, The remaining hosts as spread across 15 other data centres in Europe, US and ASIA.
*We will need to write custom checks, but for the most part standard out-of-the-box checks will apply as we’ll be focusing very much on infrastructure monitoring (disk, CPU, mem, services etc)
*We’ll need the ability to internally monitor all services from a central dashboard
*We’ll need the ability for our customers to view hosts from a central dashboard
*We will need a DR failover

We are leaning towards the Federated Distributed Monitoring Solution. I am working on the idea of three geographical Nagios XI instances (Europe, US, ASIA). Each instance would be HA/DRDB active/active cluster or ESX running a singular high spec VM.

We would elect our primary UK datacentre to act as a master Nagios instance for the purpose of NOC monitoring dashboards etc.

The geographical failover aspect is still a little unclear to me given the global coverage of our data centres. For example, we have a relatively small server footprint in Japan and Singapore (less than 100 hosts per site) so Hong Kong would be the nearest primary ASIA datacentre for a regional NagiosXI instance. We have a 100MB line between Japan, Singapore and Hong Kong. But.. in the event of a catastrophic issue in Hong Kong, we'd lose the monitoring for all Japan and Singapore hosts. The same problem can be applied to our other smaller data centres in the US and Europe.

To summarise, we cannot have a dependency on a single regional datacentre.

So, in the event we lose access to a regional NagiosXI instance, I will need to send my service checks across continents, or double up on the NagiosXI instances in each region (NagiosXI instances in Hong Kong and Japan for example).

I am not too concerned with budget, we can afford the right solution. I just don’t want to go in heavy handed unnecessarily.

Does anyone have any ideas/input on handling site-to-site failover? Either across the globe or across countries?

Cheers!

tmcdonald · Post by **tmcdonald** » Fri Jun 10, 2016 1:20 pm

Addressing each point would make for a very long and involved post, and probably would stray into consulting territory. What I will say is that we partner with a company called LinBit who does an excellent job of building and supporting DRBD/HA setups. If you are interested I can pass along your information, or you can reach out to them here: https://www.linbit.com/en/

From a Nagios standpoint, a single XI server should be able to handle up to about 20,000 total hosts+services, assuming a 5-minute check interval. Anything beyond that and the reports tend to run a bit slow due to the sheer amount of data being combed through. You may consider passive checks due to the distributed nature of your environment

Post by **eloyd** » Mon Jun 13, 2016 9:33 am

One word: mod_gearman. Of course, that's not really enough words to describe how to do what you want to do, but our solution would be to use mod_gearman, maybe a couple of Nagios Core collectors, and some replication back to centralized Nagios XI servers for alerting and reporting.

As @tmcdonald alluded to, this is a consulting question more than a support question.

hsmith · Post by **hsmith** » Mon Jun 13, 2016 4:28 pm

You may also want to to take a look at this video.

Nagios Support Forum

Global Distributed Monitoring with Failover

Global Distributed Monitoring with Failover

Re: Global Distributed Monitoring with Failover

Re: Global Distributed Monitoring with Failover

Re: Global Distributed Monitoring with Failover