I am not sure what happened, but the amount of incoming logs dramatically decreased. So I started investigating and noticed that the GUI was running really slow. I tried running commands via the console and that also was really slow. Eventually now when I checked the cluster status, it is in red and reporting 1 node only. Checking each node, they are now reporting that they are the master with only 1 node showing. Not really show what happened, but really need to get this cluster back up and running, but wanted to make sure we bring this back without causing any split brain issues. Please help, thanks.
Node 1
# curl 'localhost:9200/_cat/master?v'
id host ip node
meXRK6XITBO6x_Mfgju6yw node1 10.242.102.107 4521585a-88af-47c9-81e5-c4d13cffb148
# curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 56,
"active_shards" : 56,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 115
}
Node 2
# curl 'localhost:9200/_cat/master?v'
id host ip node
4Ctq93IFT3WHVqK3Mo5VeQ node2 10.242.102.108 30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5
# curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 59,
"active_shards" : 59,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 113
}
Node 3
# curl 'localhost:9200/_cat/master?v'
id host ip node
TPLs_kcbQca8OeYwAxlANg node3 10.242.102.109 e63648a3-d912-4f5d-a867-1b99282a5e7c
# curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 58,
"active_shards" : 58,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 114
}
Status red with all nodes thinking it is the master
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: Status red with all nodes thinking it is the master
I have brought the nodes back online by gracefully rebooting them and disabling allocation until the nodes are back. The nodes are still balancing the shards, but looks like it should be good. I am still trying to find a root cause, but looks like it could be a memory issue. I am seeing OutofMemoryError in the logs here and there.
I've been tailing the elasticsearch logs and saw this come up... perhaps we are running out of resources,
I've been tailing the elasticsearch logs and saw this come up... perhaps we are running out of resources,
Code: Select all
[2015-08-12 14:51:33,852][WARN ][index.translog ] [30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5] [nagioslogserver][0] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException: [nagioslogserver][0] Flush failed
at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)
at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3089)
at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:858)
Re: Status red with all nodes thinking it is the master
CFT6Server,
Based on the fact that each node thought it was the master, it sounds like you encountered a split-brain scenario. This is something that we need to avoid moving forward, and to do so we'll need to set the 'minimum master node' setting under the hood.
Access the primary elasticsearch config file:
Change:
#discovery.zen.minimum_master_nodes: 1
To:
discovery.zen.minimum_master_nodes: 2
You can read more about this setting here: https://www.elastic.co/guide/en/elastic ... ster_nodes
If you're seeing out of memory errors on any node in your cluster, I highly recommend upping the amount of RAM in that node until you stop seeing said errors. This is the most common problem people have with Nagios Log Server - it can result in a lot of odd symptoms. My recommendation is to double the memory across all of your nodes (up to a maximum of 60GB per node) and check on how much free memory exists on your nodes after doing so. That will give you a good benchmark regarding how much memory is necessary per node.
For instance, if I have an 8GB node and free -m reports '312' free - I might double this node to 16GB of RAM. After I double the memory, free -m reports 2000 free - which means that I could likely reduce the amount of memory in the node by roughly 1GB or so and still have enough leftover to feel safe.
Best,
Jesse
Based on the fact that each node thought it was the master, it sounds like you encountered a split-brain scenario. This is something that we need to avoid moving forward, and to do so we'll need to set the 'minimum master node' setting under the hood.
Access the primary elasticsearch config file:
Code: Select all
vi /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml#discovery.zen.minimum_master_nodes: 1
To:
discovery.zen.minimum_master_nodes: 2
You can read more about this setting here: https://www.elastic.co/guide/en/elastic ... ster_nodes
If you're seeing out of memory errors on any node in your cluster, I highly recommend upping the amount of RAM in that node until you stop seeing said errors. This is the most common problem people have with Nagios Log Server - it can result in a lot of odd symptoms. My recommendation is to double the memory across all of your nodes (up to a maximum of 60GB per node) and check on how much free memory exists on your nodes after doing so. That will give you a good benchmark regarding how much memory is necessary per node.
For instance, if I have an 8GB node and free -m reports '312' free - I might double this node to 16GB of RAM. After I double the memory, free -m reports 2000 free - which means that I could likely reduce the amount of memory in the node by roughly 1GB or so and still have enough leftover to feel safe.
Best,
Jesse