XI installation and failover design

gormank · Post by **gormank** » Wed Apr 01, 2015 2:23 pm

We have 2 sites in separate states each with ~60 devices to be monitored. Most are Windows or linux with storage, etc. as well. Each site is a mirror of the other with identical hardware, software and configuration. There is a primary and failover site.

We are beginning to set up Nagios XI to monitor the systems. We have a total of 3 XI subscriptions, one for each site as well as another for a lab. I know this is excessive.

There is a connection between sites with limited bandwidth, so there's a desire to limit dataflow between sites. There are also very few ports open between sites.

My question is what's the most practical (or best practices) installation? I had planned to set up 2 Nagios servers per site; one primary and one failover. Each primary would monitor its site. In order to allow monitoring of both sites from either installation, I had planned to transfer data from each site to the other.

Since there are a small number of devices, I'm wondering if the above is overkill, and if there will be issues with maintaining users, configuration, etc. between primaries, failovers, as well as sites. For instance, are users replicated between a primary/failover pair, or will I need to maintain 4 sets of users, one for each XNI server?

If I were to go with a 2 server setup with a primary and failover, what problems can I expect in the event of a failure of the primary and transfer to the failover?

Thanks

jdalrymple · Post by **jdalrymple** » Wed Apr 01, 2015 3:51 pm

The "obvious" solution there is to run identical XI configs with the exception of inverse hostgroups based upon geography, then use gearman to distribute the checks.

You can then also have backup XI servers just sitting there cold collecting SCP backups from the other site - though this could potentially require a decent amount of bandwidth - perhaps off hours? the backups contain all of your perfdata which can get fairly large.

The biggest thing to look out for when doing your failover server is the ability to come up and assume the identity of the primary server so that passive checks work.

I imagine this post will raise more questions - feel free to reach back out.

mp4783 · Post by **mp4783** » Wed Apr 01, 2015 4:18 pm

Sadly, Nagios does not have good answer for this. Where I work, I created utilities that backup Nagios on the primary server and ship the backups to the DR server, where they can be restored if necessary. The backups are made every hour, so at worse you lose only an hour's data. Any passive checks would have to be sent to both servers.

This is an interim solution at best. My company cannot afford monitoring downtime. So I'm thinking about the following:

- Realtime replication of MySQL and PostgreSQL databases with a tightly coupled rsync job to keep configuration files in sync. The challenge here is that any reference to the hostname or address of the primary would need to be changed on the DR server. On the DR, you would constantly disable all monitoring via the external command pipeline and turn it on only when the primary has failed. Passive checks would need to be sent to all Nagios XI servers.

- Another idea would be have the Nagios XI servers communicate with each other and "share" the load across themselves. Same principle as above where you actively replicate the configuration from a master to slaves, even though they would all be sharing the load. You suppress checks via the external command pipeline (or other mechanisms) that are not being handled by that particular server.

- A third idea would be to have your databases on an external server (both MySQL and PostgreSQL). You would use whatever redundancy mechanisms you like to make sure these servers were available. You would then need mechanism to rationalize and synchronize the configuration files. Again, you would need to dynamically control the checks that were active on any one server. I suppose an easier way to deal with this would be to simply shut down the Nagios core monitoring process and only bring it online when you failover. I'm not really sure of the consequences of this as the service check scheduler might get cranky.

I've not played with the Mod Gearman stuff too much. I have it set up on a lab server and if you read the documentation, it does show various load balancing and failover scenarios, but lacks a configuration synchronization mechanism.

These are very "pie-in-the-sky" right now. I haven't given this in depth thought.

jdalrymple · Post by **jdalrymple** » Wed Apr 01, 2015 4:35 pm

mp4783 wrote:Sadly, Nagios does not have good answer for this.

<snip>

These are very "pie-in-the-sky" right now. I haven't given this in depth thought.

mp4783, I think you're seeking HA. Your solutions are interesting but I think they miss the point of what OP was seeking, that being DR. Can you confirm gormank? From what I read you're looking to have instances running in both locations, but not in an HA configuration, and to simply have a way to recover the primary at either site to the secondary site in the event that the primary site fails.

gormank · Post by **gormank** » Wed Apr 01, 2015 4:44 pm

I'm thinking of both in a way. I come from an HA background and tend to think in those terms. My understanding is Nagios has no HA and trying to make it redundant is going to be a pain.
At this point, I'm looking for first, an answer on the practicality or impracticality of my 4 server "solution," and/or suggestions on how to design a reliable multi-site installation of Nagios XI. Since every other bit of hardware and software is has a twin to fail over to in the other site, I have to have 2 Nagios servers at least. In the event of a failure, there has to be a way to fail over.
I have no desire to maintain 4 (or even 2) sets of identical users, or configs.

jdalrymple · Post by **jdalrymple** » Wed Apr 01, 2015 5:01 pm

Is an active instance in both locations necessary? The checks crossing between sites will produce minimal traffic.

Do you have a prescribed RPO/RTO you have to meet?

I think gearman in secondary site, active instance in primary site and cold failover system in the secondary site. It's very simple.

Do you have layer 2 stretched between sites? If not you will have to do some DNS trickery at failover if you have any passive checks. That's the biggest gotcha.

gormank · Post by **gormank** » Wed Apr 01, 2015 5:15 pm

Is an active instance in both locations necessary? -- No, that's just how I pictured it. Each instance monitoring the other.

Do you have a prescribed RPO/RTO you have to meet? -- No, but losing data would be frowned upon.

I think gearman in secondary site, active instance in primary site and cold failover system in the secondary site. It's very simple. -- Sorry, failover in this setup is manual, right? Simple sounds good.

Do you have layer 2 stretched between sites? If not you will have to do some DNS trickery at failover if you have any passive checks. That's the biggest gotcha. -- No.

jdalrymple · Post by **jdalrymple** » Thu Apr 02, 2015 9:19 am

Is an active instance in both locations necessary? -- No, that's just how I pictured it. Each instance monitoring the other.

Monitoring "each other" could just use some active checks such as the builtin check_http or check_nagios. You could extend those checks by having an event handler that would react to an outage and bring your secondary version of the other site's instance online.

Do you have a prescribed RPO/RTO you have to meet? -- No, but losing data would be frowned upon.

I'm a bit worried about the backup and ship to warm-spare method because of your bandwidth constraint desires. Keeping a low RPO could actually be gained by offloading your perfdata and databases. In the event of your warm-spare coming online at either site you would be running "across campus" which may not be possible. If not you would have to find a way to sync your offloaded perfdata and you would probably want to implement your own db shipping. The db backups (especially in an environment of only 80 hosts or so) would have a small footprint and should be very mobile, you'd just have to come up with your own backup and ship method as the builtin includes the perfdata.

I think gearman in secondary site, active instance in primary site and cold failover system in the secondary site. It's very simple. -- Sorry, failover in this setup is manual, right? Simple sounds good.

yes - but could probably be at least partially or fully automated if you wanted to get clever with event handlers as mentioned above. The K.I.S.S. method falls more and more apart the more you try to automate things, but this should be fairly doable.

Do you have layer 2 stretched between sites? If not you will have to do some DNS trickery at failover if you have any passive checks. That's the biggest gotcha. -- No.

DNS trickery is going to be either your best friend or worst enemy... unless you don't do ANY passive checks. Also when configuring all your agents you'll have to keep multiple IPs in mind for access.

mp4783 · Post by **mp4783** » Thu Apr 02, 2015 5:45 pm

The backup, ship, and restore is a bandwidth hog. Files when compressed average 150 MB+. I have the advantage of working at a company that has lots of bandwidth, so this solution might only work for those in similar situations.

jdalrymple · Post by **jdalrymple** » Fri Apr 03, 2015 8:53 am

I agree from a bandwidth perspective backup/ship/restore is unideal. OP indicated only 120 hosts total though which shouldn't be anywhere near your 150MB compressed backup - I'd guess about an order of magnitude less or so.

Nagios Support Forum

XI installation and failover design

XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design