Advice on creating Failover for XI

benhank · Post by **benhank** » Mon Mar 25, 2019 3:54 pm

I edited this one. dont lock it because I'm rephrasing it so it actually reflects my thought, Ill repost in a few

ssax · Post by **ssax** » Tue Mar 26, 2019 1:45 pm

What about your retention.dat? That contains the latest comments/downtime info.

The rest looks ok but there's likely other things that could be backed up as well, what about the audit.log?

What about the historical data in /usr/local/nagios/var/archives?

What about your scheduled backups/reports in /var/spool/cron/apache?

What about your MRTG configs and RRDs in /etc/mrtg, /etc/mrtg/conf.d, /var/lib/mrtg?

What about your performance data?

benhank · Post by **benhank** » Tue Mar 26, 2019 3:35 pm

mr sax, those are an easily answered questions. ----> dang if I know...
But seriously, up until this point, we were ok with the secondary server just being able to take over monitoring and sending notifications until we stop screaming at each other and blaming each other for whatever caused the primary to go down and get it back up and running.
But thinks have changed and you have given me a very good idea of what I missed when I created that script.
That said:
I have 4 servers that are going to be used to create an automatic failover between my production and secondary Nagios servers. The attached file contains a script that runs nightly and syncs data between my production and secondary server. I figured why not take the sync a step further and try to creat an high availability solution.

Here are the servers (we use a Sherlock Holmes themed naming convention)
Lkensherlockp01 ---- Primary Nagios server with its MySQL db offloaded
Lkenfusionp01 ---- Server that contains the offloaded MySQL db (not used as a fusion server)
Lkendrwatsonp01 ----- Secondary Nagios Prod server
LkenCIAp01 ---- Nagios Core VM that will monitor the 3 servers listed above and will execute scripts and event handlers
Before proceeding it is important to know that lkendrwatson has the
nagios and sendmail services in a stopped state.
If we need it to get to work we manually reenable those two services.

So here we go:
LkenCIAp01 will check the following services on Lkensherlockp01:
Ping
mysqld
postgresql
nagios
nagiosxi
httpd
ndo2db
sendmail

I will create an individual script for each service (7 in total).
The scripts will be associated with an event handler.
Each script will perform the following function, if the service being checked enters a hard down state it will attempt to restart the service, if the service doesn’t start then a command is sent via ssh (we have set up passwordless ssh) to start
Nagios
Sendmail
On lkendrwatson.
The only exception is ping. If lkensherlock can’t be pinged then the command to start Nagios and sendmail on lkendrwatsonp01 will be sent.

It can’t be done right now, but in the near future I plan on putting both Prod servers behind a VIP so the failover will be transparent to the end user.
And that is my big idea! What do you guys think? (after you stop laughing, of course)

ssax · Post by **ssax** » Tue Mar 26, 2019 3:52 pm

Is your secondary going to use the same DB as prod?

If so, then make sure that when you are failing over you do this on the current primary before failing over:
- NOTE: I would probably disable the current cron jobs in /etc/cron.d/nagiosxi so that it doesn't try to run them first, then run these commands:

Code: Select all

systemctl stop npcd
systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done

The reason you do this is because if the primary fails and then the failover takes over and two instances of NDO2DB are talking to the database you can get multiple instance IDs which can cause some really weird effects with displaying and requires a manual cleanup (manual SQL queries to clean it up).
- NOTE: This is generally why I tell people to do this on a test system first and make sure it works really really well when writing your own custom failover solution (I'd probably just look at using a DRBD setup if it was me) because we won't continue fixing problems that are directly caused by a custom broken failover process.

benhank · Post by **benhank** » Wed Mar 27, 2019 8:23 am

thanks for the advice man, and no both servers have their own db.

ssax · Post by **ssax** » Wed Mar 27, 2019 10:23 am

Ok, cool, that's definitely easier then, let me know if you have any questions.

Nagios Support Forum

Advice on creating Failover for XI

Advice on creating Failover for XI

Re: Advice on creating Failover for XI

Re: Advice on creating Failover for XI

Re: Advice on creating Failover for XI

Re: Advice on creating Failover for XI

Re: Advice on creating Failover for XI