XI installation and failover design

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mp4783
Posts: 116
Joined: Wed May 14, 2014 11:11 am

Re: XI installation and failover design

Post by mp4783 »

The load balancing option is certainly attractive if you're using one that would allow for "intelligent" failover if a Nagios XI server drops.

I hadn't considered the VMware solution, but we're a big user so that's also a great idea.

Thanks.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: XI installation and failover design

Post by abrist »

mp4783 wrote:The load balancing option is certainly attractive if you're using one that would allow for "intelligent" failover if a Nagios XI server drops.
I think this requires finger quotes: "Load Balancing". XI (more specifically core), does not actually load balance/cluster well. The reason for virtual ips is to simplify the agent configuration on the remote hosts. In case of a failover scenario, the vip can change hosts while the agents can continue reporting back to the same ip.

You can cluster a few services like mysql, mrtg, and postgres, but in reality, only on core parent process should be running at any given time.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mp4783
Posts: 116
Joined: Wed May 14, 2014 11:11 am

Re: XI installation and failover design

Post by mp4783 »

I did sort of trivialize things. What you need is strong session affinity in the load balancer, a mechanism to detect loss of a Nagios XI server, and a method to reapportion the load. This again would have to involve a mechanism that would tell Nagios to start active service checks. Passive service checks could potentially take care of themselves.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: XI installation and failover design

Post by abrist »

mp4783 wrote: This again would have to involve a mechanism that would tell Nagios to start active service checks.
Exactly this. I find that utilizing the virtual ip service (whether using pacemaker,uCarp,etc) to handle the starting of the services on the secondary in a failover scenario to be easier than trying to run two simultaneous nagios processes. When the primary goes down, the vip service will change to the secondary. Most vip/cluster management utilities will then kick off a number of scripts, one of them could be to start all the necessary nagios related services on the secondary. You will still need to handle stonith somehow to ensure that the original primary has been completely stopped/put out of it's misery.
Image
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mp4783
Posts: 116
Joined: Wed May 14, 2014 11:11 am

Re: XI installation and failover design

Post by mp4783 »

Firstly, thanks for the cartoon, it started my Monday morning off with a laugh.

In my case, I'm looking for an active load-balanced solution similar to a MySQL cluster (conceptually) where all nodes are active and the load is dynamically and logically balanced amongst the nodes. Our environment could eventually contain hundreds of Nagios XI servers, so we need to be able to drop in new servers and have the load rebalance itself. I realize I am trivializing what could be a major development effort.

The alternative in our case would be to implement your suggestion and simply put in 2 Nagios XI servers (in different data centers) with one acting as the primary and the other as the failover. This, by the way, is how a lot of our stuff is architected today, although we don't use VIPs.

Lastly, as your cartoon suggests, preventing a situation where none of the nodes can figure out which hosts its supposed to monitor would be a very high priority. Intrinsically safe systems such as avionics, life support equipment, etc. have very deterministic methods for doing this, so we would need to consider something like that before attempting to implement this.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: XI installation and failover design

Post by abrist »

mp4783 wrote:Firstly, thanks for the cartoon, it started my Monday morning off with a laugh.
I believe the original artist is one of the DRBD techs. :)
mp4783 wrote: In my case, I'm looking for an active load-balanced solution similar to a MySQL cluster (conceptually) where all nodes are active and the load is dynamically and logically balanced amongst the nodes. Our environment could eventually contain hundreds of Nagios XI servers, so we need to be able to drop in new servers and have the load rebalance itself.
This is really the holy grail. Unfortunately, nagios core does not handle dynamic reallocation of checks between servers. Maybe someday a central config management system could be used with redis or similar check/result queues, as this would allow for multiple core process to load balance by pulling off or inserting onto distributed queues.
mp4783 wrote:I realize I am trivializing what could be a major development effort.
:)
mp4783 wrote:The alternative in our case would be to implement your suggestion and simply put in 2 Nagios XI servers (in different data centers) with one acting as the primary and the other as the failover. This, by the way, is how a lot of our stuff is architected today, although we don't use VIPs.
This is currently the best method given the nature of nagios core at the moment. Doubly so as it fits your current practices. I really want to stress how great virtual ips are though. They really do trivialize a few of the biggest speed bumps in the creation of a failover architecture.
mp4783 wrote: Lastly, as your cartoon suggests, preventing a situation where none of the nodes can figure out which hosts its supposed to monitor would be a very high priority. Intrinsically safe systems such as avionics, life support equipment, etc. have very deterministic methods for doing this, so we would need to consider something like that before attempting to implement this.
Stonith (S.hoot T.he O.ther N.ode I.n T.he H.ead) is the concept. This is *best* done with an actual stonith device (server management cards and UPSes sometimes support this), but you can use a software method through ssh or similar (not the best idea though).

Honestly the full Linux HA stack, or some custom combination of pacemaker/drbd/stonith/custom scripts are really best options.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mp4783
Posts: 116
Joined: Wed May 14, 2014 11:11 am

Re: XI installation and failover design

Post by mp4783 »

Agreed on all points. If it weren't for the bandwidth overhead, DRBD to a DR server with the Nagios collector/scheduler shut down and VIPs would be best.

Just carve out a separate device/filesystem for the Nagios installation, shutdown the Nagios collector/scheduler, and turn on DRBD. In a failure, shift the VIP to the DR server and fire up the collector. In theory, you could do something very similar to this using MySQL and PostgreSQL replication, along with rsync. This might actually give you better control as you wouldn't be dealing with the blind "copying" of DRBD.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: XI installation and failover design

Post by abrist »

mp4783 wrote:In theory, you could do something very similar to this using MySQL and PostgreSQL replication, along with rsync
Most definitely. I think I covered this briefly in my presentation. You would want to offload what you can: mysql, postgres, and potential nfs shares for perfdata and mrtg configs/rrds (you could also offload mrtg if you do large quantities of bandwidth checks), and then rsync the rest:

Network share:

Code: Select all

/use/local/nagios/share/perfdata/*
/var/lib/mrtg/*  
/etc/mrtg/* 
Rsync:

Code: Select all

/usr/local/nagios/*
/usr/local/nagiosxi/*
Offload:

Code: Select all

/var/lib/pgsql/ 
/var/lib/mysql/
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: XI installation and failover design

Post by gsl_ops_practice »

Hello all,

We are in a similar situation but I hope our dilemma is a bit easier to solve.

We have 2 geographically separate sites, each one has a NagiosXI instance and we have a decent WAN link between the two. Each NagiosXI instance is able to monitor the local and remote sites and the NagiosXI configuration is identical on both sites (I make a change on one, then backup/restore on the other to replicate my changes)

The issue we have is only one of these NagiosXI servers can have its notifications enabled or we end up with 2 emails for every alarm. Is there a way we can automatically enable notifications on the second XI host when the primary is deemed to down?

Thanks,
Alex
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: XI installation and failover design

Post by jolson »

gsl_ops_practice,

This isn't a problem - in fact I just finished answering this in a ticket of mine:

Disable all active host and service checks and notifications on your secondary server:

Access the nagios.cfg file:

Code: Select all

vi /usr/local/nagios/etc/nagios.cfg
Change:
execute_host_checks=1
execute_service_checks=1
enable_notifications=1


To:
execute_host_checks=0
execute_service_checks=0
enable_notifications=0


Restart Nagios:

Code: Select all

service nagios restart
Scripted, you might come up with something like this:

Make failover.sh (run this script on a cronjob every 1-10 minutes):

Code: Select all

vi /root/failover.sh

Code: Select all

#!/bin/bash

#check the primary host
/root/primarycheck.sh #this is your custom check to determine whether Nagios-Primary is working properly. I am assuming that this check exits '0' if your primary is fine, and exist with any other number if there are problems
exitc=$?

if [ $exitc != 0 ]; then
#activate relevant config settings and restart nagios
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/bin/sed -i 's/execute_host_checks=0/execute_host_checks=1/' /usr/local/nagios/etc/nagios.cfg
/etc/init.d/nagios restart
exit 2

else
#do nothing
exit 0
fi
Note that the above script is something I threw together for demonstration purposes - you'd likely want it to be a little more robust, but I think it gets the idea across. Your primary check could be as simple as a ping check (which of course could be risky) or as complicated as checking an email account for fresh nagios emails.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked