Page 2 of 4

Re: Redundancy or load balancing on log server

Posted: Thu Nov 17, 2016 8:16 pm
by gormank
Are there things to avoid doing, or known things that make this happen?

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 11:18 am
by rkennedy
I would run at least a 3 node cluster, and alter the following variable in your nano /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml file.

Code: Select all

# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# discovery.zen.minimum_master_nodes: 1
Set this to two, so that one node can drop off fine, but if ALL of them lose connectivity then it does not continue to operate as expected until the connection between at least two of them is restored.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 12:43 pm
by gormank
Crap.
We bought a 2 node. Interesting that to get reliability, we need to spend 50% more and add a node to make it reliable. I'll need to look for a mature product.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 1:42 pm
by mcapra
gormank wrote:I'll need to look for a mature product.
Unless you're going with a managed solution like Loggly or Splunk Cloud (and paying the associated premiums), you will face the same hurdles.

In terms of what's in the wild for un-managed solutions, I'll reference Splunk Enterprise since I've worked with it before. You may notice in their load balancing documentation, one of the first things they do (besides configuring individual outputs, no need for that in NLS) is set up what is effectively RRDNS at the transport layer by using multiple A records. No out-of-the-box solution to be found there:
http://docs.splunk.com/Documentation/Fo ... dbalancing

There's also additional steps in there for load balancing within the storage layer specific to Splunk, which Elasticsearch takes care of on its own pretty effectively with no further hassle. There's a reason groups like StackExchange leverage Elasticsearch for their horizontal scaling needs: deployment is easy once you know your platform's requirements.

This article also references transport layer load balancing with Splunk's "universal forwarder" in place of traditional load balancing methods which is essentially the same thing as Logstash-forwarder and, in the case presented by the Splunk docs, isn't more sophisticated than RRDNS at it's core.

On the topic of requiring 3 nodes to mitigate split-brain, that's just the bare minimum of what's required for a distributed system to achieve proper consensus while maintaining a failover simultaneously. It's not exclusive to Elasticsearch.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 2:08 pm
by gormank
Linux clusters manage the quorum issue (>50%) by using a filesystem to break the tie, so a 2 node cluster is possible at least in a Linux cluster.
Maybe I'm just not used to the real world and expecting more than I should.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 2:31 pm
by mcapra
You're right about the file system, but that doesn't 100% account for proper shard allocation on the back-end of elasticsearch. You might be able to halt both elasticsearch instances and restart to correct that, but i'm betting the cluster health will be red and a bunch of shards will need manual redirection.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 2:44 pm
by gormank
In reading about NLS I saw no statement that a 2 node system was not recommended. That would have been a nice thing since the BS to get another PO to upgrade is more painful than spending the money.
Assuming we go w/ the 2 node cluster for the 1st year, how likely is it that it will get fubar in that time. Best guess--I won't whine if it fails or not... :)

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 2:58 pm
by tmcdonald
I would say there is a distinction to be made between "not recommended" and "recommended against". We do not specifically recommend against a 2-node cluster, and in fact it's a great choice for some organizations. However, in general we recommend that more nodes be allocated to allow for that extra redundancy and reliability. Just like CPU, RAM, disk space, manpower, coffee, and that third serving of turkey next week, over-allocation relative to today's needs allows for more breathing room tomorrow.

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 3:04 pm
by gormank
Got a link to the recommendation?

Re: Redundancy or load balancing on log server

Posted: Fri Nov 18, 2016 3:15 pm
by tmcdonald
Might not be directly stated, but on this page:

https://www.nagios.com/products/nagios-log-server/

The 4-instance is the default that is highlighted, and the 2-instance contains a usage note that says
Intended for production deployments that don't require our highest grade of redundancy (available with 4+ instances)
Under the FAQ on that same page, the question "What is an instance? How many do I need?" is answered by:
Nagios Log Server systems are based on a clustering model. Each server in the cluster is called an Instance. Adding an Instance to your Log Server cluster allows you to balance server load, create a redundant copy of log event data, and scale Log Server to meet your environment’s needs. Keep your data highly available and redundant with additional Nagios Log Server Instances. Each instance in the cluster shares in the workload of indexing and querying your data. A minimum of 2 instances is recommended to provide redundancy and resiliency.
(Emphasis mine)

So 2 is the minimum recommended, but the paragraph as a whole pushes additional instances (and not just as a Sales opportunity).

I couldn't point you to anything that specifically says "Nagios Enterprises recommends 4 instances" or anything like that, since each organization is going to have different needs. If they are on a shoestring budget, we very well may recommend sticking to 2 instances, or even just 1 if they really lack a budget. Otherwise, we're going to scale things out as the situation merits.