- Each server has 72GB of memory
- 24 CPU
- Disks SSD (Used only 273.6GB)
A network communication is fully open between the 4 servers.
After some time of execution of the cluster, it begins to present errors of "timeout notification from cluster".
Stopping all nodes in the cluster and then starting them one by one solves the problem for some time then after some time the problem repeats again.
I have a similar NLS infrastructure; Which does not have this problem (the only difference between the two environments is that the amount of memory in this cluster is larger, 72GB)
Code: Select all
[2017-05-17 14:59:35,308][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:36,048][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:40,555][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:41,737][INFO ][cluster.service ] [765cc658-3e5f-4923-804e-5eb57735f761] master {new [8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}}, removed {[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-17 14:59:42,272][INFO ][cluster.service ] [765cc658-3e5f-4923-804e-5eb57735f761] added {[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-17 14:59:43,456][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 8026645 numDocs: 8026645 vs. true
[2017-05-17 14:59:44,065][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 5337923 numDocs: 5337923 vs. true
[2017-05-17 14:59:44,201][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 3267651 numDocs: 3267651 vs. true
[2017-05-17 14:59:44,638][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 3490976 numDocs: 3490976 vs. true
[2017-05-17 14:59:44,817][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 30589 numDocs: 30589 vs. true
[2017-05-17 14:59:44,878][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7071508 numDocs: 7071508 vs. true
[2017-05-17 14:59:44,968][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7309883 numDocs: 7309883 vs. true
[2017-05-17 14:59:45,042][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 14 numDocs: 14 vs. true
[2017-05-17 14:59:45,066][INFO ][indices.recovery ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7267696 numDocs: 7267696 vs. true
[2017-05-17 15:47:58,117][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,118][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,699][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,699][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,705][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,705][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:59,115][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:59,116][DEBUG][action.bulk ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]How do I solve this problem?
The outages are causing me problems, because when I need to restart the environment, I lose precious logs.
I'm using the nagios-plugin-elasticsearch plugin to monitor the infrastructure, which is normal.
https://github.com/anchor/nagios-plugin-elasticsearch
Code: Select all
[root@datalog-utb-log2 libexec]# /usr/bin/check_elasticsearch
Monitoring cluster 'a5726a09-769e-4f2b-be91-d786c8165c6f' | cluster_nodes=4;;;;; cluster_master_eligible_nodes=4;;;;; cluster_data_nodes=4;;;;; cluster_active_shards=112;;;;; cluster_relocating_shards=0;;;;; cluster_initialising_shards=0;;;;; cluster_unassigned_shards=0;;;;; cluster_total_shards=112;;;;; cluster_total_indices=12;;;;; cluster_closed_indices=0;;;;; storesize=145073960589B;;;;; documents=126724576c;;;;; index_ops=31256865c;;;;; index_time=12698519ms;;;;; flush_ops=742c;;;;; flush_time=75367ms;;;;; throttle_time=382464ms;;;;; index_ops=31256865c;;;;; index_time=12698519ms;;;;; delete_ops=0c;;;;; delete_time=0ms;;;;; get_ops=8c;;;;; get_time=11ms;;;;; exists_ops=8c;;;;; exists_time=11ms;;;;; missing_ops=0c;;;;; missing_time=0ms;;;;; query_ops=1886c;;;;; query_time=1744345ms;;;;; fetch_ops=139c;;;;; fetch_time=9113ms;;;;; merge_ops=16862c;;;;; refresh_ops=80026c;;;;; refresh_time=2888406ms;;;;; gc_old_count=2c;;;;; gc_young_count=8243c;;;;; heap_used=5%;;;;;
Code: Select all
[root@datalog-utb-log2 libexec]# curl -XGET 'localhost:9200/_nodes/jvm?pretty'
{
"cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
"nodes" : {
"dfP5HEcGRE6YKtN6A2t8bg" : {
"name" : "765cc658-3e5f-4923-804e-5eb57735f761",
"transport_address" : "inet[/10.154.9.93:9300]",
"host" : "datalog-utb-log1.servicos",
"ip" : "10.154.9.93",
"version" : "1.6.0",
"build" : "cdd3ac4",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"attributes" : {
"max_local_storage_nodes" : "1"
},
"jvm" : {
"pid" : 1651,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1494876174495,
"mem" : {
"heap_init_in_bytes" : 37913362432,
"heap_max_in_bytes" : 37757386752,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 37757386752
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
}
},
"2CakXOyBTzC-c93gPAN-mg" : {
"name" : "5c998cfb-0460-4e56-8697-83b65c086a13",
"transport_address" : "inet[/10.154.3.100:9300]",
"host" : "datalog-ugt-log2.gtservicos",
"ip" : "10.154.3.100",
"version" : "1.6.0",
"build" : "cdd3ac4",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"attributes" : {
"max_local_storage_nodes" : "1"
},
"jvm" : {
"pid" : 5136,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1494876196111,
"mem" : {
"heap_init_in_bytes" : 37913362432,
"heap_max_in_bytes" : 37757386752,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 37757386752
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
}
},
"bvE6hf73SQGc3DP05nyWHA" : {
"name" : "8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c",
"transport_address" : "inet[/10.154.9.94:9300]",
"host" : "datalog-utb-log2.servicos",
"ip" : "10.154.9.94",
"version" : "1.6.0",
"build" : "cdd3ac4",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"attributes" : {
"max_local_storage_nodes" : "1"
},
"jvm" : {
"pid" : 13892,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1494876148786,
"mem" : {
"heap_init_in_bytes" : 37913362432,
"heap_max_in_bytes" : 37757386752,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 37757386752
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
}
},
"4Khu6bhxR0Cx3S0VID-9KQ" : {
"name" : "8471b9e1-1a82-4c3d-98bc-03f2ce871369",
"transport_address" : "inet[/10.154.3.99:9300]",
"host" : "datalog-ugt-log1.gtservicos",
"ip" : "10.154.3.99",
"version" : "1.6.0",
"build" : "cdd3ac4",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"attributes" : {
"max_local_storage_nodes" : "1"
},
"jvm" : {
"pid" : 25718,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1494876176820,
"mem" : {
"heap_init_in_bytes" : 37913362432,
"heap_max_in_bytes" : 37757386752,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 37757386752
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
}
}
}
}
[root@datalog-utb-log2 libexec]# curl -XGET 'localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 56,
"active_shards" : 104,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 1,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
tail /var/log/elasticsearch/a5726a09-769e-4f2b-be91-d786c8165c6f.log
Code: Select all
[root@datalog-utb-log2 libexec]# tail /var/log/elasticsearch/a5726a09-769e-4f2b-be91-d786c8165c6f.log
[2017-05-17 15:07:17,170][WARN ][transport.netty ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] exception caught on transport layer [[id: 0x5e7a6df0, /10.154.9.72:41630 => /10.154.9.94:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (16,3,1,0)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2017-05-17 15:07:17,171][WARN ][transport.netty ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] exception caught on transport layer [[id: 0x5e7a6df0, /10.154.9.72:41630 :> /10.154.9.94:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (16,3,1,0)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:482)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireChannelDisconnected(Channels.java:396)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:360)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:81)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
at org.elasticsearch.common.netty.channel.Channels.close(Channels.java:812)
at org.elasticsearch.common.netty.channel.AbstractChannel.close(AbstractChannel.java:206)
at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:638)
at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.elasticsearch.common.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.notifyHandlerException(DefaultChannelPipeline.java:658)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:566)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2017-05-17 17:11:00,428][INFO ][discovery.zen ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] master_left [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2017-05-17 17:11:00,429][WARN ][discovery.zen ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] master left (reason = transport disconnected), current nodes: {[765cc658-3e5f-4923-804e-5eb57735f761][dfP5HEcGRE6YKtN6A2t8bg][datalog-utb-log1.servicos][inet[/10.154.9.93:9300]]{max_local_storage_nodes=1},[5c998cfb-0460-4e56-8697-83b65c086a13][2CakXOyBTzC-c93gPAN-mg][datalog-ugt-log2.gtservicos][inet[/10.154.3.100:9300]]{max_local_storage_nodes=1},[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}
[2017-05-17 17:11:00,429][INFO ][cluster.service ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] removed {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1})
[2017-05-17 17:11:01,172][DEBUG][action.admin.cluster.state] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] no known master node, scheduling a retry
[2017-05-17 17:11:15,857][DEBUG][action.admin.cluster.health] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] no known master node, scheduling a retry
[2017-05-17 17:11:24,463][INFO ][cluster.service ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] detected_master [8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[root@datalog-utb-log2 libexec]#
One or more indexes are missing replica shards. Use -vv to list them. Index 'logstash-2017.05.14' replica down on shard 0 Index 'logstash-2017.05.10' replica down on shard 0 Index 'logstash-2017.05.10' replica down on shard 1 Index 'logstash-2017.05.10' replica down on shard 2 | cluster_nodes=4;;;;; cluster_master_eligible_nodes=4;;;;; cluster_data_nodes=4;;;;; cluster_active_shards=108;;;;; cluster_relocating_shards=0;;;;; cluster_initialising_shards=4;;;;; cluster_unassigned_shards=0;;;;; cluster_total_shards=112;;;;; cluster_total_indices=12;;;;; cluster_closed_indices=0;;;;; storesize=112737631630B;;;;; documents=99309932c;;;;; index_ops=31340845c;;;;; index_time=12734028ms;;;;; flush_ops=742c;;;;; flush_time=75367ms;;;;; throttle_time=16241ms;;;;; index_ops=31340845c;;;;; index_time=12734028ms;;;;; delete_ops=0c;;;;; delete_time=0ms;;;;; get_ops=8c;;;;; get_time=11ms;;;;; exists_ops=8c;;;;; exists_time=11ms;;;;; missing_ops=0c;;;;; missing_time=0ms;;;;; query_ops=1886c;;;;; query_time=1744345ms;;;;; fetch_ops=139c;;;;; fetch_time=9113ms;;;;; merge_ops=16887c;;;;; refresh_ops=80150c;;;;; refresh_time=2892490ms;;;;; gc_old_count=2c;;;;; gc_young_count=8264c;;;;; heap_used=4%;;;;;
Some time later the status returns to "status": "green"
However, after about 2 or 3 days with the cluster running, it crashes, and I need to restart all services.
References:
https://discuss.elastic.co/t/elasticsea ... vice/32904