Page 1 of 2

High Availability

Posted: Tue Sep 30, 2014 1:31 pm
by BanditBBS
For HA what does everyone think is better, Mike Weber's talk here: http://www.slideshare.net/nagiosinc/mike-weber-failover

or using VMware HA/FT? We want to do this between two datacenters so if storage and/or host and/or dc itself goes down we can have a working Nagios up and running quickly.

Thanks!

Re: High Availability

Posted: Tue Sep 30, 2014 1:59 pm
by mikew
Since I did that talk on failover much has changed. Many companies now use clustered VMWare instances which reduces the chances of problems. Larger companies are now opting for a replication of their master instead of doing failover to a running machine as there is far less maintenance and downtime is minimal.

Advantages of Failover:
Relatively Simple Set Up
The set up described is relatively simple and works well for a mature system that will not change much. The actual time to failover is usually less than 7 minutes. This is a very short amount of time to be blind.

Disadvantages to a Failover:
Passive Checks
One thing that will impact your decision is if you use passive. With passive checks you end up with a problem in that you do not want to be sendig output to two servers. So in the failover option that I used in the example, does not work well with passive.

Constant Changes
If you make significant changes to the Nagios master you will need those replicated on the slave. If you add a new plugin with dependencies, you have to add it to the slave.

Re: High Availability

Posted: Tue Sep 30, 2014 2:20 pm
by abrist
mikew wrote:Disadvantages to a Failover:
Passive Checks
One thing that will impact your decision is if you use passive. With passive checks you end up with a problem in that you do not want to be sendig output to two servers. So in the failover option that I used in the example, does not work well with passive.
This problem can be mitigated with a virtual ip through ucarp, pacemaker, keepalived, etc.
mikew wrote: Constant Changes
If you make significant changes to the Nagios master you will need those replicated on the slave. If you add a new plugin with dependencies, you have to add it to the slave.
This can be reduced with a shared volume. You will still need to install packages on the secondary, but all of the nagios data/plugin locations can be moved to a shared volume (like drbd/nfs/etc).

I would suggest vmware HA if it is an option. Just make sure your san/disk io is *fast*. Otherwise I would suggest looking at the linux HA stack - specifically drbd/pacemaker.

Re: High Availability

Posted: Tue Sep 30, 2014 2:51 pm
by Box293
BanditBBS wrote:using VMware HA/FT? We want to do this between two datacenters so if storage and/or host and/or dc itself goes down we can have a working Nagios up and running quickly.
This is actually a good use or VMware technology however you need some very low latency links between the datacenters for this to work. This is because the hosts at each datacenter need access to the same storage for HA/FT to work. Which kinda makes you wonder what location the storage is in ...

Re: High Availability

Posted: Wed Oct 01, 2014 4:27 pm
by BanditBBS
Yeah, I'm trying to think of ways to do this without same storage access issue.

The reason - What if our Chicago DC goes down(cable broken, tornado, volcano, etc). That is where our Nagios server and its storage is located. Customers that have servers in our other DCs or our managed/not hosted customers will want monitoring to be active until the Chicago DC comes back online. If the San Fran Nagios was using same storage we wouldn't be able to failover in that case. I'm starting to lean towards the daily XI backup that is ssh'd to the SF server and if CHI ever goes down we can spend the few minutes to ru nthe XI restore script on the other server. Sure, we won't have history and stuff, but we'd have active monitoring. We'd have the history back once CHI came back online(as long as it wasn't a volcano).

Thoughts?

Re: High Availability

Posted: Thu Oct 02, 2014 9:30 am
by abrist
BanditBBS wrote:Thoughts?
This is a decent method. You could use rsync instead of (or in combination with) the backup script to keep things more up to date. The big questions deal with the databases - you could offload them and then replicate them to the other location. Or just replicate the most important tables for monitoring (the ql tables), and use a daily backup for the rest.

Re: High Availability

Posted: Thu Oct 02, 2014 10:49 am
by BanditBBS
I've never had to use the restore script, so not sure on how well it operates.

What all is included when doing the backup in XI?

Re: High Availability

Posted: Thu Oct 02, 2014 11:26 am
by abrist
everything important:
databases, nagiosxi dir, core configs, libexec, mrtg configs, rrds, mrtg rrds.
You just need to pay attention to any third party additions like the oracle/vmware sdks, java, etc.

Re: High Availability

Posted: Sat Oct 11, 2014 9:49 pm
by eloyd
BanditBBS wrote:I've never had to use the restore script, so not sure on how well it operates.

What all is included when doing the backup in XI?
Backing up XI and restoring XI is piece of cake with backup/restore script - so long as you're just using XI. As Andy said, you need to watch for add-ons and so forth. We just did this for a customer to prove that our disaster recovery worked (snapshotted the machine first, of course). Even installed it to different box. Easy peasy, "one and done" kind of operation.

Re: High Availability

Posted: Mon Oct 13, 2014 9:22 am
by abrist
eloyd wrote: Easy peasy, "one and done" kind of operation.
The light side of disaster recovery is very simple. It gets much more complicated in HA/minimal downtime configurations, and even more difficult in large federated models.