Distributed Monitoring and active active failover

stravis007 · Post by **stravis007** » Mon Sep 23, 2019 11:36 pm

I have a large client that is looking to install several XI instances in several data centers. The idea would be that each XI instance would monitor hosts and services in it's local datacenter, and feed all data back to a central fusion instance. The question they keep coming back to is "How do I ensure that ALL hosts and services are being monitored when we're patching?" So If I patch nagios server A in location A, how do I ensure that the hosts in location A are being monitored by the XI instance in location B and vice versa?

Fusion will provide the single view of the whole environment, but how to ensure that our monitoring environment is 100% up and monitoring 100% of all services/hosts at all times even during a patch cycle (reboot of a nagios XI instance)..

At first thought, use a mariadb cluster to store all the data and point all nagios XI instances at that cluster? Is this a viable solution? also have to ensure that there are no duplicate notifications going out.

Are there any better/other options?

Thanks in advance.

Post by **mbellerue** » Tue Sep 24, 2019 4:04 pm

Have you seen this document?
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

stravis007 · Post by **stravis007** » Tue Sep 24, 2019 4:41 pm

Yes I have. However that doesn't really address 100% uptime. It allows for VMWare failover or monitoring the server from another source. But doesn't really address the question of how to maintain monitoring during a patching cycle.

I know the document speaks to some open source tools to create "your own cluster". However i'm looking at a more "Nagios" supported method.

VMWare HA failover is a great DR type solution, but what if I take it down on purpose for patching or an XI upgrade and I still want to maintain my monitoring?

Currently the thought is that I might create 2 separate instances of XI and using a combination of database log shipping and ansible enabled configuration shipping. Maintain an "almost" exact passive copy of XI. We could then use a load balancer to front the "cluster" (not really a cluster).... The monitored endpoints would be unaware of the backend IP changes.

shut one instance down ... start the second ... flip the load balancer ... patch ... very very close to 0 downtime.

Ansible would be responsible for handling the forced failover to the passive node and back.

In this situation I'm looking at 2 independent nagios supportable instances of XI. Or at least I think I am.

Post by **mbellerue** » Wed Sep 25, 2019 10:32 am

stravis007 wrote:Currently the thought is that I might create 2 separate instances of XI and using a combination of database log shipping and ansible enabled configuration shipping. Maintain an "almost" exact passive copy of XI. We could then use a load balancer to front the "cluster" (not really a cluster).... The monitored endpoints would be unaware of the backend IP changes.

shut one instance down ... start the second ... flip the load balancer ... patch ... very very close to 0 downtime.

Ansible would be responsible for handling the forced failover to the passive node and back.

In this situation I'm looking at 2 independent nagios supportable instances of XI. Or at least I think I am.

That sounds like a great solution. Unfortunately high availability is such a complex topic, with many different approaches, we can't really dive deep into any of the solutions.

Given the solution you laid out here, the only issue I can think of that you'll run into is performance data. Performance data is stored in round robin database files on the Nagios XI server in /usr/local/nagios/share/perfdata/. When you fail over from Primary to Secondary, there will be a gap in the performance data where Primary wasn't receiving the data, and Secondary will record that data. That's only a data reporting issue, though, rather than a stability issue. So depending on who you ask, that can be a pretty low priority.

stravis007 · Post by **stravis007** » Wed Sep 25, 2019 12:00 pm

Thank you.

I think the plan is to keep the secondary node down, and sync the data between the nodes rrd data and all. Because it will all be behind load balances we should be able to keep those things in sync as well with smaller data gaps. This should enable us to flip over to the secondary node during patching and just leave it there, until the next patch cycle.

Post by **mbellerue** » Thu Sep 26, 2019 11:14 am

Okay, perfect. Sounds like you've got the rrd files covered then. Does all of this give you a clear path forward with your project, or are there any more questions I can help with?

stravis007 · Post by **stravis007** » Thu Sep 26, 2019 11:43 am

I think I'm good ... thanks

Nagios Support Forum

Distributed Monitoring and active active failover

Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover

Re: Distributed Monitoring and active active failover