RE: [Nagios-devel] Load Balancing and Redundancy with Nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

RE: [Nagios-devel] Load Balancing and Redundancy with Nagios

Post by Guest »

Guys - before you read this message, think about this analogy - if you're
running a ferry business and you find your boat isn't able to handle all
your customers reasonable needs, do you just keep operating that way and let
competition come in or do you look at what the source of the problem is and
see if you can find ways to adjust it so you can improve your operation and
attract more customers than before? What I'm suggesting isn't trying to
give up the boat we're using, instead, look at ways of adapting it to better
meet the ever growing needs of Nagios's users. If we ever expect Nagios to
be truly Enterprise Class, we need to look beyond where we are now. If we
expect Nagios to become and remain a monitoring leader, we need to look
ahead as well. Ethan and the rest of the people here - you have all done a
great job making NetSaint/Nagios into the wonderful monitoring tool it is
today. As with any tool like it, Nagios can be improved.

Granted, I don't have all the answers, nor do I claim that mine are the
best. I'm simply asking that we all look seriously at the new design
paradigm and compare it against the old. Is it doable? Is it better? Is
there something we can learn from it? Should we throw it away? By looking
at it fairly, then we can honestly answer these questions. Otherwise, we're
throwing the baby out with the bath water.

First, this is important to me not only because my company has a lot of
monitoring happening (800+ hosts) but also because it's important to my team
that our monitoring tools react quickly to problems without adding
significant load to the WAN. As we continue to strive to raise the bar in
my group on our service availability goals (approaching 99.9997% uptime),
monitoring becomes more and more critical to being proactive rather than
reactive. The concepts I'm proposing strive to achieve a number of goals:

1) Reduce the number of non-required tests done while everything is working
(decreases load on the customer servers).
2) Reduce the dependency on an individual monitoring host or subnet to be
available.
3) Reduce the dependency on an individual monitoring host to be able to
perform notifications.
4) Provide automatic rebalancing if one of the monitoring hosts on the
network goes down or comes (back) on-line.
5) Provide a common point/method of configuration for all the monitoring
with minimal added complexity.
6) Make restoration of a completely destroyed monitoring host as simple as
restoring a backup of the monitoring software, tools and configuration
(without modification) on a new server configured with the same IP address
as the old host.

The issue I have with the previous suggestions is that there's a
master/slave relationship. It also means that multiple testing points are
not utilized in a balanced fashion (either every monitor tests everything
producing a high bandwidth and load requirement on the network and the
customer servers, or some hosts are not tested at all on some monitoring
hosts/locations) and what happens when the master or slave is destroyed
completely... Restoring a master node requires a different procedure than a
slave.

Let's look at it a little differently. Let's say that I have four monitors
in four different physical locations. I have the networks tied together
using both my own equipment/lines and external routes for redundancy. The
monitors would talk to each other using both IP and multicast (to help
reduce bandwidth utilization).

In the initial setup, the monitors talk to each other to establish the
monitoring ring and determine a notification server. The first server to
come on-line that's able to perform the notification process is generally
set up as the notification server, though that function is always
dynamically determined. If that server goes down for any reason, or if that
server knows it's not able to complete notifications, it gives up its spot
as notification server and the remaining monitors renegotiate that position.
As new monitors come on-line, the notification monitor remains at the
current location until it can

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ark Gillett [mailto:mgillett@myrealbox.com
Locked