Global Distributed Monitoring with Failover
Posted: Fri Jun 10, 2016 6:55 am
Hi,
I am interested to hear your suggestions for our Nagios XI implementation. I have done a fair amount of reading, but I’d just like to hear peoples opinions and ideas.
*We’ll need to monitor >2500 hosts with a mixture of Windows and Linux hosts. As well as storage and Network devices. A mixture of physical and Virtual hosts.
*We have a global datacentre presence. With 1000 hosts in a primary UK datacentre, The remaining hosts as spread across 15 other data centres in Europe, US and ASIA.
*We will need to write custom checks, but for the most part standard out-of-the-box checks will apply as we’ll be focusing very much on infrastructure monitoring (disk, CPU, mem, services etc)
*We’ll need the ability to internally monitor all services from a central dashboard
*We’ll need the ability for our customers to view hosts from a central dashboard
*We will need a DR failover
We are leaning towards the Federated Distributed Monitoring Solution. I am working on the idea of three geographical Nagios XI instances (Europe, US, ASIA). Each instance would be HA/DRDB active/active cluster or ESX running a singular high spec VM.
We would elect our primary UK datacentre to act as a master Nagios instance for the purpose of NOC monitoring dashboards etc.
The geographical failover aspect is still a little unclear to me given the global coverage of our data centres. For example, we have a relatively small server footprint in Japan and Singapore (less than 100 hosts per site) so Hong Kong would be the nearest primary ASIA datacentre for a regional NagiosXI instance. We have a 100MB line between Japan, Singapore and Hong Kong. But.. in the event of a catastrophic issue in Hong Kong, we'd lose the monitoring for all Japan and Singapore hosts. The same problem can be applied to our other smaller data centres in the US and Europe.
To summarise, we cannot have a dependency on a single regional datacentre.
So, in the event we lose access to a regional NagiosXI instance, I will need to send my service checks across continents, or double up on the NagiosXI instances in each region (NagiosXI instances in Hong Kong and Japan for example).
I am not too concerned with budget, we can afford the right solution. I just don’t want to go in heavy handed unnecessarily.
Does anyone have any ideas/input on handling site-to-site failover? Either across the globe or across countries?
Cheers!
I am interested to hear your suggestions for our Nagios XI implementation. I have done a fair amount of reading, but I’d just like to hear peoples opinions and ideas.
*We’ll need to monitor >2500 hosts with a mixture of Windows and Linux hosts. As well as storage and Network devices. A mixture of physical and Virtual hosts.
*We have a global datacentre presence. With 1000 hosts in a primary UK datacentre, The remaining hosts as spread across 15 other data centres in Europe, US and ASIA.
*We will need to write custom checks, but for the most part standard out-of-the-box checks will apply as we’ll be focusing very much on infrastructure monitoring (disk, CPU, mem, services etc)
*We’ll need the ability to internally monitor all services from a central dashboard
*We’ll need the ability for our customers to view hosts from a central dashboard
*We will need a DR failover
We are leaning towards the Federated Distributed Monitoring Solution. I am working on the idea of three geographical Nagios XI instances (Europe, US, ASIA). Each instance would be HA/DRDB active/active cluster or ESX running a singular high spec VM.
We would elect our primary UK datacentre to act as a master Nagios instance for the purpose of NOC monitoring dashboards etc.
The geographical failover aspect is still a little unclear to me given the global coverage of our data centres. For example, we have a relatively small server footprint in Japan and Singapore (less than 100 hosts per site) so Hong Kong would be the nearest primary ASIA datacentre for a regional NagiosXI instance. We have a 100MB line between Japan, Singapore and Hong Kong. But.. in the event of a catastrophic issue in Hong Kong, we'd lose the monitoring for all Japan and Singapore hosts. The same problem can be applied to our other smaller data centres in the US and Europe.
To summarise, we cannot have a dependency on a single regional datacentre.
So, in the event we lose access to a regional NagiosXI instance, I will need to send my service checks across continents, or double up on the NagiosXI instances in each region (NagiosXI instances in Hong Kong and Japan for example).
I am not too concerned with budget, we can afford the right solution. I just don’t want to go in heavy handed unnecessarily.
Does anyone have any ideas/input on handling site-to-site failover? Either across the globe or across countries?
Cheers!