Redundancy or load balancing on log server

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
gormank
Posts: 1114
Joined: Tue Dec 02, 2014 12:00 pm

Re: Redundancy or load balancing on log server

Post by gormank »

Are there things to avoid doing, or known things that make this happen?
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Redundancy or load balancing on log server

Post by rkennedy »

I would run at least a 3 node cluster, and alter the following variable in your nano /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml file.

Code: Select all

# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# discovery.zen.minimum_master_nodes: 1
Set this to two, so that one node can drop off fine, but if ALL of them lose connectivity then it does not continue to operate as expected until the connection between at least two of them is restored.
Former Nagios Employee
gormank
Posts: 1114
Joined: Tue Dec 02, 2014 12:00 pm

Re: Redundancy or load balancing on log server

Post by gormank »

Crap.
We bought a 2 node. Interesting that to get reliability, we need to spend 50% more and add a node to make it reliable. I'll need to look for a mature product.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Redundancy or load balancing on log server

Post by mcapra »

gormank wrote:I'll need to look for a mature product.
Unless you're going with a managed solution like Loggly or Splunk Cloud (and paying the associated premiums), you will face the same hurdles.

In terms of what's in the wild for un-managed solutions, I'll reference Splunk Enterprise since I've worked with it before. You may notice in their load balancing documentation, one of the first things they do (besides configuring individual outputs, no need for that in NLS) is set up what is effectively RRDNS at the transport layer by using multiple A records. No out-of-the-box solution to be found there:
http://docs.splunk.com/Documentation/Fo ... dbalancing

There's also additional steps in there for load balancing within the storage layer specific to Splunk, which Elasticsearch takes care of on its own pretty effectively with no further hassle. There's a reason groups like StackExchange leverage Elasticsearch for their horizontal scaling needs: deployment is easy once you know your platform's requirements.

This article also references transport layer load balancing with Splunk's "universal forwarder" in place of traditional load balancing methods which is essentially the same thing as Logstash-forwarder and, in the case presented by the Splunk docs, isn't more sophisticated than RRDNS at it's core.

On the topic of requiring 3 nodes to mitigate split-brain, that's just the bare minimum of what's required for a distributed system to achieve proper consensus while maintaining a failover simultaneously. It's not exclusive to Elasticsearch.
Former Nagios employee
https://www.mcapra.com/
gormank
Posts: 1114
Joined: Tue Dec 02, 2014 12:00 pm

Re: Redundancy or load balancing on log server

Post by gormank »

Linux clusters manage the quorum issue (>50%) by using a filesystem to break the tie, so a 2 node cluster is possible at least in a Linux cluster.
Maybe I'm just not used to the real world and expecting more than I should.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Redundancy or load balancing on log server

Post by mcapra »

You're right about the file system, but that doesn't 100% account for proper shard allocation on the back-end of elasticsearch. You might be able to halt both elasticsearch instances and restart to correct that, but i'm betting the cluster health will be red and a bunch of shards will need manual redirection.
Former Nagios employee
https://www.mcapra.com/
gormank
Posts: 1114
Joined: Tue Dec 02, 2014 12:00 pm

Re: Redundancy or load balancing on log server

Post by gormank »

In reading about NLS I saw no statement that a 2 node system was not recommended. That would have been a nice thing since the BS to get another PO to upgrade is more painful than spending the money.
Assuming we go w/ the 2 node cluster for the 1st year, how likely is it that it will get fubar in that time. Best guess--I won't whine if it fails or not... :)
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Redundancy or load balancing on log server

Post by tmcdonald »

I would say there is a distinction to be made between "not recommended" and "recommended against". We do not specifically recommend against a 2-node cluster, and in fact it's a great choice for some organizations. However, in general we recommend that more nodes be allocated to allow for that extra redundancy and reliability. Just like CPU, RAM, disk space, manpower, coffee, and that third serving of turkey next week, over-allocation relative to today's needs allows for more breathing room tomorrow.
Former Nagios employee
gormank
Posts: 1114
Joined: Tue Dec 02, 2014 12:00 pm

Re: Redundancy or load balancing on log server

Post by gormank »

Got a link to the recommendation?
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Redundancy or load balancing on log server

Post by tmcdonald »

Might not be directly stated, but on this page:

https://www.nagios.com/products/nagios-log-server/

The 4-instance is the default that is highlighted, and the 2-instance contains a usage note that says
Intended for production deployments that don't require our highest grade of redundancy (available with 4+ instances)
Under the FAQ on that same page, the question "What is an instance? How many do I need?" is answered by:
Nagios Log Server systems are based on a clustering model. Each server in the cluster is called an Instance. Adding an Instance to your Log Server cluster allows you to balance server load, create a redundant copy of log event data, and scale Log Server to meet your environment’s needs. Keep your data highly available and redundant with additional Nagios Log Server Instances. Each instance in the cluster shares in the workload of indexing and querying your data. A minimum of 2 instances is recommended to provide redundancy and resiliency.
(Emphasis mine)

So 2 is the minimum recommended, but the paragraph as a whole pushes additional instances (and not just as a Sales opportunity).

I couldn't point you to anything that specifically says "Nagios Enterprises recommends 4 instances" or anything like that, since each organization is going to have different needs. If they are on a shoestring budget, we very well may recommend sticking to 2 instances, or even just 1 if they really lack a budget. Otherwise, we're going to scale things out as the situation merits.
Former Nagios employee
Locked