Distributed Setup - multiple centers

cscholz · Post by **cscholz** » Mon Jun 25, 2012 10:59 am

Thanks so much in advance for reading all these words and offering advice!

We have multiple data centers we'd like to monitor. Here's what we have right now:

ONE Nagios XI VM monitoring our essential systems across ALL data centers
ONE Nagios XI VM monitoring our core network across ALL data centers
ONE Nagios Core VM (linux "load" of 25-30+!!) monitoring all CPEs across the entire network, with T1s groomed to all our various data centers, plus MPLS circuits.

All of these VMs are being replaced with Dell R410s and R610s, with distribution across data centers. What we need is a good way to monitor systems and networking at each data center, with redundancy, and be able to monitor them all centrally.

Currently my idea is to have a physical nagios server at each location and monitor them all using the Nagios Fusion product

Question 1: Do I really need to separate out the networks and systems servers, or can they share a nagios installation? I feel like a R410 or R610 with sufficient CPU and RAM should be able to handle our core network and systems hosts, especially if we're installing one at EACH colo facility. The CPE monitoring system must have their own nagios boxes since they use heavily modified code and won't play nice with Nagios XI, and we can't even consider bringing all three functions into one box per colo.

Concept:
Colo A:
Dell R410 - Nagios XI, monitors systems and networks living at Colo A
Dell Rx10 - Nagios Core, monitors client CPE networks (T1, tunnel, or MPLS) groomed to Colo A

Colo B:
Dell R410 - Nagios XI, monitors systems and networks living at Colo B
Dell Rx10 - Nagios Core, monitors client CPE networks (T1, tunnel, or MPLS) groomed to Colo B (This colo is MOSTLY MPLS and not physical T1s. Unlikely to stop growing in size)

Colo C:
Dell R410 - Nagios XI, monitors systems and networks living at Colo C
Dell R610 - Nagios Core, monitoring MPLS only (no point to point T1s live here) *AND* hosting Nagios Fusion to tie all other sites together.

On top of this monitoring, DNX would be implemented on specially provisioned VMs as needed. Idea being that if a box is getting bogged down, we turn up some DNX vm nodes and let them start taking on some work.

Each Dell R410 or 610 will be using 4 10k 600GB drives in RAID 10, dual CPU, with at least 8 GB of RAM. I'm looking for critique and help with any necessary redesign of this design scheme, as this is my first time building out a distributed Nagios implementation. Is the hardware adequate? Is more CPU prudent? Will disk I/O be an issue and therefore would it make sense to use 15k drives?

Additionally, very important: Is there any way to use this set up to handle fail over? Clearly we can monitor each Nagios box from the other nagios boxes, and Nagios fusion itself will tell us when a box isn't responding, but is there any way to make sure that, say Colo A's nagios kicks in if Colo B's nagios dies? *OR*, can we set up two Identical nagios servers and fail over to the hot spare if the primary dies?

I appreciate any and all help. Diagrams would be immensely useful from those of you who have already implemented similar systems.

Thank you all SO MUCH!

scottwilkerson · Post by **scottwilkerson** » Mon Jun 25, 2012 3:40 pm

Hopefully some others that have setup similar can share some options.

If you haven't seen it yet you should take a look at our Nagios XI High Availability Options document.

Also, Disk IO will definitely come into play in large environments, you should also take a look at
http://assets.nagios.com/downloads/nagi ... p#boosting

cscholz · Post by **cscholz** » Mon Jun 25, 2012 3:44 pm

Any comment on the layout? I will post a diagram too.

Each large box is a data center. Within each data center we'll have two nagios servers. One for systems/core network and one for customer CPE. All will communicate with DNX VMs, provisioned as needed, and we'll monitor all instances with Nagios fusion. Does this make sense as a solution?

Post by **gwakem** » Mon Jun 25, 2012 4:17 pm

I've done this at a previous position. The way we had it set up then: we had roughly 20 datacenters, and one dedicated (active) nagios box monitoring the infrastructure of each, out to the ISP's core network. All active systems connected back to HQ through a VPN tunnel where they would send all received data via NSCA to a passive only server. The passive server would handle all alerting, downtime, charting of perfdata, etc, which freed up the active nagios servers to strictly process checks. The active servers were comprised a dell 1u rack mount dual core box with 4GB RAM, (and this was four to five years ago,) with 10k drives, but they were only running Nagios Core on CentOS at runlevel 3. The passive server was where people ran reports, charted historical perfdata, etc.

The hardware requirements should completely be dependent on the amount of checks you run as well as the amount of processing the checks require. For example, I have some VCenter checks that like to eat 70-100% of my CPU for roughly 4-8 seconds, and when we get, say, 20-30 of them hitting at once, stuff can get nasty. However, if you're doing snmp checks with simple return values, then no problem. Even better if you're not processing perfdata. Disks will also be affected by the amount of checks. Disk IO will go up when you're running several thousand checks against several hundred hosts, but again, whether this is an issue would depend on the checks you're running and if you're writing out RRDs on the local box. You can't go wrong with high speed disks if youre planning on a large installation. Raid Arrays over infiniband or fiber are even better.

You can absolutely put monitoring of the network and servers on one box. This is actually better in order to chart parent/child relationships at the local level of the data center, so it can handle the processing of dependencies and then pass the results to the passive parent. If you split network and servers out to separate boxes, the passive parent can still chart out of the parent/child relationships, but its entirely dependent on when the alerts come in.You cant force a recheck, and being passive, it cant perform the logic regarding parent/child dependencies for anything other than what's already come in, so some alerts may slip through.

How we handled failover: we used linux HA for heartbeat. This way, when one server would "lose" the other, it automatically assumed the primary IP, started the services, and continued on. There was a blip, since some checks that were in progress got lost at the active server level, but at the passive it wouldn't matter since the time before received results were marked stale was set at about 10 minutes. The active checks would run every five, so even worst case scenario, we wouldn't get false alarms. For the passive server, we located the database on it's own box, so if the passive parent went belly up, the data wasn't lost. RRDs were on a NFS mount over fiber channel, and data was synced between the NFS server and it's backup. Similarly, mysql replicated to it's spare.

Again, there are probably better ways to do this now (with easier to use tools, I did all of this by hand from the command line, so you have no idea how grateful I am for NagiosXI's interface) but the basic logic behind how this was set up should still be sound.

cscholz · Post by **cscholz** » Mon Jun 25, 2012 4:37 pm

10,000 RPM drives in RAID 5 or RAID 10 not fast enough? Think there will be a disk IO problem? We *do* intend on turning on perfdata once the VMs are out of the picture. If we need to, we can use 15k drives. We were planning on doing 600GB 10k drives in RAID 10. All servers have minimum 8GB ram.

The thought of one nagios server monitoring 20 data centers makes me shudder, as I can see first hand the intense pressure checking 5000 service checks with one box is causing now. Each CPE is a Cisco router and we're monitoring via SNMP. The VM is hitting 1-minute average loads of 25 during the main checks.

mguthrie · Post by **mguthrie** » Mon Jun 25, 2012 4:54 pm

Are you using MRTG (the switch and router wizard) to monitor all of your switches?

Also, we discovered in the last week that DNX is having an issue with the current version of Nagios Core 3.4.1 where some checks can get frozen in the event queue, and don't get run on time (KevinD and gwakem discovered this). We're working on the solution for that at the moment, but if you're looking to start this option very soon, I might also consider Mod Gearman as another way to distribute checks. We don't have a doc for Mod Gearman yet, but all of the packages for it are available in the yum repos, and it does have a little more flexibility has to how checks are distributed.

To help with disk I\O, you can make use of a RAM disk to help, just make sure to test it in a dev environment first, and follow the instructions very closely. It's a killer tweak if it's all working correctly, but if it's set up wrong it has ugly side affects. I'm able to run 50 checks per second on just a standard desktop 7k RPM IDE drive without any latency when all of the options are moved to a RAM disk.

Also, if you have a second drive mounted just for the /usr/local/nagios/share/perfdata directory, that'll take a huge burden off of the main drive. I know of another user who needed redundant perfdata and used a shared network drive for this (if I remember correctly).

Nagios Support Forum

Distributed Setup - multiple centers

Distributed Setup - multiple centers

Re: Distributed Setup - multiple centers

Re: Distributed Setup - multiple centers

Re: Distributed Setup - multiple centers

Re: Distributed Setup - multiple centers

Re: Distributed Setup - multiple centers