Advice on creating Failover for XI

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Advice on creating Failover for XI

Post by benhank »

I edited this one. dont lock it because I'm rephrasing it so it actually reflects my thought, Ill repost in a few
You do not have the required permissions to view the files attached to this post.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Advice on creating Failover for XI

Post by ssax »

What about your retention.dat? That contains the latest comments/downtime info.

The rest looks ok but there's likely other things that could be backed up as well, what about the audit.log?

What about the historical data in /usr/local/nagios/var/archives?

What about your scheduled backups/reports in /var/spool/cron/apache?

What about your MRTG configs and RRDs in /etc/mrtg, /etc/mrtg/conf.d, /var/lib/mrtg?

What about your performance data?
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: Advice on creating Failover for XI

Post by benhank »

mr sax, those are an easily answered questions. ----> dang if I know...
But seriously, up until this point, we were ok with the secondary server just being able to take over monitoring and sending notifications until we stop screaming at each other and blaming each other for whatever caused the primary to go down and get it back up and running.
But thinks have changed and you have given me a very good idea of what I missed when I created that script.
That said:
I have 4 servers that are going to be used to create an automatic failover between my production and secondary Nagios servers. The attached file contains a script that runs nightly and syncs data between my production and secondary server. I figured why not take the sync a step further and try to creat an high availability solution.

Here are the servers (we use a Sherlock Holmes themed naming convention)
Lkensherlockp01 ---- Primary Nagios server with its MySQL db offloaded
Lkenfusionp01 ---- Server that contains the offloaded MySQL db (not used as a fusion server)
Lkendrwatsonp01 ----- Secondary Nagios Prod server
LkenCIAp01 ---- Nagios Core VM that will monitor the 3 servers listed above and will execute scripts and event handlers
Before proceeding it is important to know that lkendrwatson has the
nagios and sendmail services in a stopped state.
If we need it to get to work we manually reenable those two services.

So here we go:
LkenCIAp01 will check the following services on Lkensherlockp01:
Ping
mysqld
postgresql
nagios
nagiosxi
httpd
ndo2db
sendmail

I will create an individual script for each service (7 in total).
The scripts will be associated with an event handler.
Each script will perform the following function, if the service being checked enters a hard down state it will attempt to restart the service, if the service doesn’t start then a command is sent via ssh (we have set up passwordless ssh) to start
Nagios
Sendmail
On lkendrwatson.
The only exception is ping. If lkensherlock can’t be pinged then the command to start Nagios and sendmail on lkendrwatsonp01 will be sent.

It can’t be done right now, but in the near future I plan on putting both Prod servers behind a VIP so the failover will be transparent to the end user.
And that is my big idea! What do you guys think? (after you stop laughing, of course)
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Advice on creating Failover for XI

Post by ssax »

Is your secondary going to use the same DB as prod?

If so, then make sure that when you are failing over you do this on the current primary before failing over:
- NOTE: I would probably disable the current cron jobs in /etc/cron.d/nagiosxi so that it doesn't try to run them first, then run these commands:

Code: Select all

systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
The reason you do this is because if the primary fails and then the failover takes over and two instances of NDO2DB are talking to the database you can get multiple instance IDs which can cause some really weird effects with displaying and requires a manual cleanup (manual SQL queries to clean it up).
- NOTE: This is generally why I tell people to do this on a test system first and make sure it works really really well when writing your own custom failover solution (I'd probably just look at using a DRBD setup if it was me) because we won't continue fixing problems that are directly caused by a custom broken failover process.
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: Advice on creating Failover for XI

Post by benhank »

thanks for the advice man, and no both servers have their own db.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Advice on creating Failover for XI

Post by ssax »

Ok, cool, that's definitely easier then, let me know if you have any questions.
Locked