Migrating checks to "live" servers

rlacasse · Post by **rlacasse** » Sun Mar 08, 2015 1:35 pm

Trying to find an elegant way to simplify tracking live and non-live server in a pair of redundant servers.

Most everything else I have uses hostgroups to set services for any given machine type. With paired up servers (live/backup pair), we use a floating IP (with a DNS alias) to indicate which one is live. Our previous home grown system had a concept of "only on host", so that you could set up a bunch of checks, identical on both systems, but the one with the live alias that matched the "only on host" setting would execute the check while the other would not.

The only way I've really found to do this in Nagios is to setup up two hostgroups, one for live checks and the other for backup checks. Trouble is, when an automatic failover occurs on the servers, Nagios alarms like crazy until you realize what's happened, log in to the Core Config and swap the host groups, not ideal. It would be great if this was automatic somehow.

One alternative I've considered is to set up all the common checks in one hostgroup, the live in another and the backup in a 3rd. Then you could set up both the hosts with the common check, add a host with the live alias and another for the backup alias. When the IP moves to a new system, the "live" checks follow. What I don't like with this solution is that you can't look at all the services for the live host without looking at both the host with the common checks and the host with the live alias checks.

In a previous post I'd asked it a host specific override was possible for hosts in hostgroup. It was not at the time, has that changed?

Any suggestions or alternatives you've seen before?

Thank you for you time,
Regards

mp4783 · Post by **mp4783** » Sun Mar 08, 2015 3:50 pm

With respect to our friends and colleagues and Nagios, this is an area of real deficiency for Nagios. Most solutions are home grown or follow the few steps recommended in the Nagios documentation.

Firstly I'm assuming you have two servers between which you distribute checks so that the load is roughly balanced between the two. The following links describe technologies that might help:

This looks the most promising:

https://labs.consol.de/nagios/mod-gearman

http://dnx.sourceforge.net/about.html

I've not implemented either of the above, but given the size of our installation, we're going to need to consider this. An alternative to the solutions above would be to write a utility that allows the servers to actively communicate with each other and then decide which server will take which check. Conceptually, it would be something like the following:

- A process maintains common base configurations (all hosts, service, contacts, etc) on all hosts
- An algorithm decides where to distribute the load
- The utility then activates that hosts/services/etc. distributing them evenly amongst the servers (and consequently deactivating objects the server should not be monitoring)
- In the event that a Nagios XI server drops out of the "cluster", the algorithm redistributes the checks amongst the remaining hosts

There is an external command pipeline that allows you to modify configuration parameters within the running Nagios XI server. However, these changes are not permanent and are not recorded to external configuration files or the databases. They would, however, allow you modify the server configuration on-the-fly and then modify a set of "updated" configuration files on the back end.

To be sure, the preceding wouldn't be trivial, but I see no reason why it wouldn't work just fine.

One final thought would be to convert everything to passive checks and then let your load balancer distribute the inbound messages.

abrist · Post by **abrist** » Mon Mar 09, 2015 11:48 am

There are many, many ways to bake this cake. Some of the easiest suggestions, instead of altering configs (as all checks are reported to the virtual ip),just run a cronjob on the backup server, checking the primary nagios server. These checks could be as simple as up/down checks, or they could be as complex as checking to make sure all the proper services are running etc on the primary server. When the conditions are met, the cron on the backup sever will start all the necessary services to bring up nagios in a failover situation.

Am I missing something? Why do you need to mess about with multiple hosts and hostgroups?

rlacasse · Post by **rlacasse** » Mon Mar 09, 2015 12:15 pm

Thank for the feedback but my issue isn't about scalability of Nagios or a backup of the Nagios server.

The primary/backup are for another product being monitored by Nagios. I'm trying to figure out how to best monitor this other product when it's doing it's own failover without having to reconfigure Nagios when this other product switch live servers.

Thank you,

tmcdonald · Post by **tmcdonald** » Mon Mar 09, 2015 12:20 pm

OH. Totally different.

Round-robin DNS is a simple way to handle this. Whichever system is live should resolve to the DNS record, so from the Nagios point of view nothing changes. Just keep in mind these caveats:

http://stackoverflow.com/questions/5319 ... -a-records
http://serverfault.com/questions/60553/ ... ecommended

rlacasse · Post by **rlacasse** » Mon Mar 09, 2015 1:42 pm

Thanks but this isn't a DNS issue. I have 4 servers, 1 live, 3 backups. They have 30 common checks, same on every system, but only certain processes should be running on live. As such, live has a process check, ensuring the count is 1, while the backend has a similar process check ensuring count is 0.

The 4 hosts either belong to "SNIF - live", "SNIF-backup" and "SNIF-common". Currently when the SNIF software does a failover, it requires us to log in to the Nagios Core Config, go to the host that was live, switch the hostgroup from "SNIF-live" to "SNIF-backup" and then for the newly live, the reserve "SNIF-backup" to "SNIF-live". They always part of the "SNIF-common" hostgroup.

It's this manual process I'm trying to avoid and automate it somehow.

abrist · Post by **abrist** » Mon Mar 09, 2015 4:45 pm

rlacasse wrote:Thanks but this isn't a DNS issue.

In what way?
If the live server always resolves to the same hostname (or ip), just set up the live checks for that hostname. Same for the backup. This way, when the virtual ip switches over, the live checks will still run against the new live box. You could then use their non-virtual ip for all your common checks, just relying on the virtual for live/backup checks.

tmcdonald · Post by **tmcdonald** » Mon Mar 09, 2015 4:49 pm

To address your question directly, we don't have a script to automatically do the hostgroup switching you are asking for. It would require some custom coding and DB manipulation. We can certainly do that for you, but it would be custom development and something you would need to talk to Sales about.

rlacasse · Post by **rlacasse** » Tue Mar 10, 2015 8:22 am

Thanks for all your input. Looks like I'll have to stick to what I'm currently doing.

jdalrymple · Post by **jdalrymple** » Tue Mar 10, 2015 8:52 am

This doesn't sound like something that would be particularly difficult to handle with an event handler. It seems like it would be fairly trivial to swap out a static config file and reload Nagios. Does that sound like an unworkable solution?

Alternatively and more properly, a custom check would be ideal. This is one of those scenarios where you're more concerned about the end user experience (as you should be) than the individual components of the applications suite.

Nagios Support Forum

Migrating checks to "live" servers

Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers

Re: Migrating checks to "live" servers