XI installation and failover design

mp4783 · Post by **mp4783** » Mon Apr 13, 2015 7:54 am

The load balancing option is certainly attractive if you're using one that would allow for "intelligent" failover if a Nagios XI server drops.

I hadn't considered the VMware solution, but we're a big user so that's also a great idea.

Thanks.

abrist · Post by **abrist** » Mon Apr 13, 2015 11:23 am

mp4783 wrote:The load balancing option is certainly attractive if you're using one that would allow for "intelligent" failover if a Nagios XI server drops.

I think this requires finger quotes: "Load Balancing". XI (more specifically core), does not actually load balance/cluster well. The reason for virtual ips is to simplify the agent configuration on the remote hosts. In case of a failover scenario, the vip can change hosts while the agents can continue reporting back to the same ip.

You can cluster a few services like mysql, mrtg, and postgres, but in reality, only on core parent process should be running at any given time.

mp4783 · Post by **mp4783** » Mon Apr 20, 2015 2:49 pm

I did sort of trivialize things. What you need is strong session affinity in the load balancer, a mechanism to detect loss of a Nagios XI server, and a method to reapportion the load. This again would have to involve a mechanism that would tell Nagios to start active service checks. Passive service checks could potentially take care of themselves.

abrist · Post by **abrist** » Mon Apr 20, 2015 3:41 pm

mp4783 wrote: This again would have to involve a mechanism that would tell Nagios to start active service checks.

Exactly this. I find that utilizing the virtual ip service (whether using pacemaker,uCarp,etc) to handle the starting of the services on the secondary in a failover scenario to be easier than trying to run two simultaneous nagios processes. When the primary goes down, the vip service will change to the secondary. Most vip/cluster management utilities will then kick off a number of scripts, one of them could be to start all the necessary nagios related services on the secondary. You will still need to handle stonith somehow to ensure that the original primary has been completely stopped/put out of it's misery.

mp4783 · Post by **mp4783** » Mon Apr 27, 2015 8:07 am

Firstly, thanks for the cartoon, it started my Monday morning off with a laugh.

In my case, I'm looking for an active load-balanced solution similar to a MySQL cluster (conceptually) where all nodes are active and the load is dynamically and logically balanced amongst the nodes. Our environment could eventually contain hundreds of Nagios XI servers, so we need to be able to drop in new servers and have the load rebalance itself. I realize I am trivializing what could be a major development effort.

The alternative in our case would be to implement your suggestion and simply put in 2 Nagios XI servers (in different data centers) with one acting as the primary and the other as the failover. This, by the way, is how a lot of our stuff is architected today, although we don't use VIPs.

Lastly, as your cartoon suggests, preventing a situation where none of the nodes can figure out which hosts its supposed to monitor would be a very high priority. Intrinsically safe systems such as avionics, life support equipment, etc. have very deterministic methods for doing this, so we would need to consider something like that before attempting to implement this.

abrist · Post by **abrist** » Mon Apr 27, 2015 12:23 pm

mp4783 wrote:Firstly, thanks for the cartoon, it started my Monday morning off with a laugh.

I believe the original artist is one of the DRBD techs.

mp4783 wrote: In my case, I'm looking for an active load-balanced solution similar to a MySQL cluster (conceptually) where all nodes are active and the load is dynamically and logically balanced amongst the nodes. Our environment could eventually contain hundreds of Nagios XI servers, so we need to be able to drop in new servers and have the load rebalance itself.

This is really the holy grail. Unfortunately, nagios core does not handle dynamic reallocation of checks between servers. Maybe someday a central config management system could be used with redis or similar check/result queues, as this would allow for multiple core process to load balance by pulling off or inserting onto distributed queues.

mp4783 wrote:I realize I am trivializing what could be a major development effort.

mp4783 wrote:The alternative in our case would be to implement your suggestion and simply put in 2 Nagios XI servers (in different data centers) with one acting as the primary and the other as the failover. This, by the way, is how a lot of our stuff is architected today, although we don't use VIPs.

This is currently the best method given the nature of nagios core at the moment. Doubly so as it fits your current practices. I really want to stress how great virtual ips are though. They really do trivialize a few of the biggest speed bumps in the creation of a failover architecture.

mp4783 wrote: Lastly, as your cartoon suggests, preventing a situation where none of the nodes can figure out which hosts its supposed to monitor would be a very high priority. Intrinsically safe systems such as avionics, life support equipment, etc. have very deterministic methods for doing this, so we would need to consider something like that before attempting to implement this.

Stonith (S.hoot T.he O.ther N.ode I.n T.he H.ead) is the concept. This is *best* done with an actual stonith device (server management cards and UPSes sometimes support this), but you can use a software method through ssh or similar (not the best idea though).

Honestly the full Linux HA stack, or some custom combination of pacemaker/drbd/stonith/custom scripts are really best options.

mp4783 · Post by **mp4783** » Fri May 01, 2015 7:41 am

Agreed on all points. If it weren't for the bandwidth overhead, DRBD to a DR server with the Nagios collector/scheduler shut down and VIPs would be best.

Just carve out a separate device/filesystem for the Nagios installation, shutdown the Nagios collector/scheduler, and turn on DRBD. In a failure, shift the VIP to the DR server and fire up the collector. In theory, you could do something very similar to this using MySQL and PostgreSQL replication, along with rsync. This might actually give you better control as you wouldn't be dealing with the blind "copying" of DRBD.

abrist · Post by **abrist** » Fri May 01, 2015 10:54 am

mp4783 wrote:In theory, you could do something very similar to this using MySQL and PostgreSQL replication, along with rsync

Most definitely. I think I covered this briefly in my presentation. You would want to offload what you can: mysql, postgres, and potential nfs shares for perfdata and mrtg configs/rrds (you could also offload mrtg if you do large quantities of bandwidth checks), and then rsync the rest:

Network share:

Code: Select all

/use/local/nagios/share/perfdata/*
/var/lib/mrtg/*  
/etc/mrtg/*

Rsync:

Code: Select all

/usr/local/nagios/*
/usr/local/nagiosxi/*

Offload:

Code: Select all

/var/lib/pgsql/ 
/var/lib/mysql/

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Aug 11, 2015 12:45 pm

Hello all,

We are in a similar situation but I hope our dilemma is a bit easier to solve.

We have 2 geographically separate sites, each one has a NagiosXI instance and we have a decent WAN link between the two. Each NagiosXI instance is able to monitor the local and remote sites and the NagiosXI configuration is identical on both sites (I make a change on one, then backup/restore on the other to replicate my changes)

The issue we have is only one of these NagiosXI servers can have its notifications enabled or we end up with 2 emails for every alarm. Is there a way we can automatically enable notifications on the second XI host when the primary is deemed to down?

Thanks,
Alex

jolson · Post by **jolson** » Tue Aug 11, 2015 2:13 pm

gsl_ops_practice,

This isn't a problem - in fact I just finished answering this in a ticket of mine:

Disable all active host and service checks and notifications on your secondary server:

Access the nagios.cfg file:

Code: Select all

vi /usr/local/nagios/etc/nagios.cfg

Change:
execute_host_checks=1
execute_service_checks=1
enable_notifications=1

To:
execute_host_checks=0
execute_service_checks=0
enable_notifications=0

Restart Nagios:

Code: Select all

service nagios restart

Scripted, you might come up with something like this:

Make failover.sh (run this script on a cronjob every 1-10 minutes):

Code: Select all

vi /root/failover.sh

Code: Select all

#!/bin/bash

#check the primary host
/root/primarycheck.sh #this is your custom check to determine whether Nagios-Primary is working properly. I am assuming that this check exits '0' if your primary is fine, and exist with any other number if there are problems
exitc=$?

if [ $exitc != 0 ]; then
#activate relevant config settings and restart nagios
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/etc/init.d/nagios restart
exit 2

else
#do nothing
exit 0
fi

Note that the above script is something I threw together for demonstration purposes - you'd likely want it to be a little more robust, but I think it gets the idea across. Your primary check could be as simple as a ping check (which of course could be risky) or as complicated as checking an email account for fresh nagios emails.

Nagios Support Forum

XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design

Re: XI installation and failover design