I'm new to the Nagios and Linux.
We have a requirement where we need to have Failover Clustering for our Nagios XI Server.
I have seen few documents but those are little confusing and more complicated.
Can anyone shed some light on how we can achieve this in simple words? About shared storage, Imp config files, Retention.dat and DB's
we have latest Nagios XI on Cent OS in our Dev
Nagios XI Failover Clustering
Nagios XI Failover Clustering
Last edited by karthek on Mon Aug 17, 2015 3:36 am, edited 1 time in total.
"Machines don't make mistakes, we do."
Re: Nagios XI Failover Clustering
As far as failover clustering is concerned, the procedure looks like this:
Some key points to consider:
-Primary XI server must send a scheduled backup to the secondary server daily.
-Your second XI server will be restoring from the backup of the primary, which can be initiated manually or automatically (via event handlers or a script on a cron job).
-All of your agents must be accessible by both Nagios XI servers.
-All passive agents must be configured to send to both Nagios XI servers.
Deployment steps for this type of failover:
(1) Deploy and setup Primary XI Server.
(2) Configure Primary XI Server(Monitoring settings and so on).
(3) Deploy and setup Secondary XI Server.
(4) Configure the "Scheduled Backup Component" on Primary XI Server. Send the backups to the secondary server via SSH or FTP (SSH is recommended).
(5) Add a host check for Primary XI Server on Secondary XI Server.
Options for setting up the secondary server:
1. Do not run nagios on secondary and check the primary with a cron job. Start services only when the primary check fails.
2. Disable active and passive checks on the backup server and check the primary with a cron - when the primary server is down, enable all checks on the secondary server.
3. Disable notifications on secondary (allowing all checks to still run). When the primary is down, an event handler should be run turning on notifications.
If any of the above steps are confusing, please ask for clarification and I will do my best to explain it in full. Essentially you'll be taking a backup of your primary server and restoring it to the secondary server daily - this can be done through some basic scripting.
After that part is in place, we'll set up the secondary XI Server to check the primary, and if the primary is down the secondary server comes online.
Let me know. Thanks!
Some key points to consider:
-Primary XI server must send a scheduled backup to the secondary server daily.
-Your second XI server will be restoring from the backup of the primary, which can be initiated manually or automatically (via event handlers or a script on a cron job).
-All of your agents must be accessible by both Nagios XI servers.
-All passive agents must be configured to send to both Nagios XI servers.
Deployment steps for this type of failover:
(1) Deploy and setup Primary XI Server.
(2) Configure Primary XI Server(Monitoring settings and so on).
(3) Deploy and setup Secondary XI Server.
(4) Configure the "Scheduled Backup Component" on Primary XI Server. Send the backups to the secondary server via SSH or FTP (SSH is recommended).
(5) Add a host check for Primary XI Server on Secondary XI Server.
Options for setting up the secondary server:
1. Do not run nagios on secondary and check the primary with a cron job. Start services only when the primary check fails.
2. Disable active and passive checks on the backup server and check the primary with a cron - when the primary server is down, enable all checks on the secondary server.
3. Disable notifications on secondary (allowing all checks to still run). When the primary is down, an event handler should be run turning on notifications.
If any of the above steps are confusing, please ask for clarification and I will do my best to explain it in full. Essentially you'll be taking a backup of your primary server and restoring it to the secondary server daily - this can be done through some basic scripting.
After that part is in place, we'll set up the secondary XI Server to check the primary, and if the primary is down the secondary server comes online.
Let me know. Thanks!
Re: Nagios XI Failover Clustering
Thanks jolson.
But our requirement is very sensitive here. We cannot compromise on data loss which'll happen in the mentioned method.
We are planning to implement in a very huge environment, where we need some cluster service where the servers will be in Active and passive mode with a VIP.
How about a shared drive where we'll have a DB accessible to both hosts and regularly backing up/replicating imp config files.
Hope you understand. I've seen pacemaker and other options but they are little confusing.
Any ideas/suggestions will be helpful.
But our requirement is very sensitive here. We cannot compromise on data loss which'll happen in the mentioned method.
We are planning to implement in a very huge environment, where we need some cluster service where the servers will be in Active and passive mode with a VIP.
How about a shared drive where we'll have a DB accessible to both hosts and regularly backing up/replicating imp config files.
Hope you understand. I've seen pacemaker and other options but they are little confusing.
Any ideas/suggestions will be helpful.
"Machines don't make mistakes, we do."
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Nagios XI Failover Clustering
I hate to be the bearer of bad news, however if implementation is at all confusing then the project as a whole will fail. People have implemented Nagios Core in clustered configurations using various different methods, all of them very complicated. We can offer guidelines - namely the components to be kept in sync, however we cannot offer a recipe since it's not something we officially support.karthek wrote:I've seen pacemaker and other options but they are little confusing.
If working with a product such as pacemaker is beyond the skillset of your IT department, my recommendation would be to use VMware Fault Tolerance to provide the level of availability required. It is a very simple point and click way of getting the results that you're seeking. If that's not an option then I suggest you look into DRDB. Just be aware that without very strong knowledge of clustered computer systems you will be setting yourself up for much greater problems than the data loss involved with the traditional failover methods Jesse described above.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
Re: Nagios XI Failover Clustering
It's not that we do not have expertise here, as I mentioned earlier we wanted to implement this in our Test env on our own.jdalrymple wrote:If working with a product such as pacemaker is beyond the skillset of your IT department, my recommendation would be to use VMware Fault Tolerance
I have seen this https://allmybase.com/2010/10/04/settin ... s-servers/ article that seems to be exactly as per our need, but we require your suggestions on this.
Is it supported by Nagios XI , will this method create any problems?
"Machines don't make mistakes, we do."
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Nagios XI Failover Clustering
The referenced solution is based on Nagios Core, not NagiosXI.
No, we cannot support a "roll your own" failover solution. We can support the underlying NagiosXI, but if your failover solution causes any issues that would be the point at which our support would have to stop. The documentation I shared indicates our supported methods of HA.
The proper solution is all based upon what your intended platform and desired RTO/RPO are. You indicated no data loss is acceptable before which indicates to me that you need a clustered storage solution underlying your active/passive monitoring solution. Once you have that we can assist you in offloading the databases and performance data so that you can know your data is safely stored out of band.
Regarding the active/passive monitoring, once your shared storage is in place you can use something similar to the referenced solution to create an event handler that will bring your secondary instance (covered under the NagiosXI licensing scheme) online. That portion is trivial.
No, we cannot support a "roll your own" failover solution. We can support the underlying NagiosXI, but if your failover solution causes any issues that would be the point at which our support would have to stop. The documentation I shared indicates our supported methods of HA.
The proper solution is all based upon what your intended platform and desired RTO/RPO are. You indicated no data loss is acceptable before which indicates to me that you need a clustered storage solution underlying your active/passive monitoring solution. Once you have that we can assist you in offloading the databases and performance data so that you can know your data is safely stored out of band.
Regarding the active/passive monitoring, once your shared storage is in place you can use something similar to the referenced solution to create an event handler that will bring your secondary instance (covered under the NagiosXI licensing scheme) online. That portion is trivial.
Re: Nagios XI Failover Clustering
Thank you for the information.
We'll setup the cluster and will let you know how it goes.
you may lock this thread.
We'll setup the cluster and will let you know how it goes.
you may lock this thread.
"Machines don't make mistakes, we do."