Page 1 of 1

Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 11:08 am
by Jklre
About 3 times in the last week our Nagios Log server cluster has been going into a bad state. The web interface stops responding and we need to restart the elastic search service to get it to come back to life.

We are running Nagios Log Server (1.4.4)

We have 2 nodes with 4 cpu's each and 8gb of memory per node averaging 600,000 messages per 24 hours.
CPU.png
memory.png
Here's a snippet of the elastic search logs.

Code: Select all

[2017-09-18 08:58:01,706][DEBUG][action.search.type       ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] All shards failed for phase: [query_fetch]
org.elasticsearch.action.NoShardAvailableActionException: [nagioslogserver][0] null
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:160)
	at org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:57)
	at org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:47)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:104)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
	at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
	at org.elasticsearch.client.FilterClient.execute(FilterClient.java:66)
	at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.execute(BaseRestHandler.java:92)
	at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:338)
	at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:84)
	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)
	at org.elasticsearch.rest.RestController.executeHandler(RestController.java:225)
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:170)
	at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
	at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
	at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:327)
	at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:63)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:60)
	at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
	at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
	at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
[2017-09-18 08:58:01,710][DEBUG][action.search.type       ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] All shards failed for phase: [query_fetch]
org.elasticsearch.action.NoShardAvailableActionException: [nagioslogserver][0] null
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:160)
	at org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:57)
	at org.elasticsearch.action.search.type.TransportSearchQueryAndFetchAction.doExecute(TransportSearchQueryAndFetchAction.java:47)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:104)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
	at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)
	at org.elasticsearch.client.FilterClient.execute(FilterClient.java:66)
	at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.execute(BaseRestHandler.java:92)
	at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:338)
	at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:84)
	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)
	at org.elasticsearch.rest.RestController.executeHandler(RestController.java:225)
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:170)
	at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
	at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
	at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:327)
	at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:63)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:60)
	at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
	at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
	at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
[2017-09-18 08:58:01,829][DEBUG][action.index             ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Also the logstash

Code: Select all

{:timestamp=>"2017-09-18T08:55:48.150000-0700", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-09-18T08:55:48.150000-0700", :message=>"Failed to flush outgoing items", :outgoing_count=>330, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [], :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(org/elasticsearch/client/transport/TransportClientNodesService.java:279)", "org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:198)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:163)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:356)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:164)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:91)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:65)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1341)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1511)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 12:17 pm
by cdienger
Elasticsearch is limited to half of a system's total memory so of the 8 it can only work with 4 which is likely too little. If you look a little further back in the Elasticsearch logs prior to the All shards failed for phase message, do you see any memory errors? This would be a smoking gun if so.

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 12:32 pm
by Jklre
Looks like a lot of GC operations.

Code: Select all

[2017-09-18 08:41:02,405][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288149][33485] duration [10.5s], collections [1]/[10.8s], total [10.5s]/[16.5h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [159.7mb]->[167.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:41:10,791][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288150][33486] duration [8s], collections [1]/[8.3s], total [8s]/[16.5h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [167.3mb]->[61.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:41:22,560][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [routing-table-updater] took 1m above the warn threshold of 30s
[2017-09-18 08:41:22,560][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288152][33487] duration [10.7s], collections [1]/[10.7s], total [10.7s]/[16.5h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[69.1mb]/[266.2mb]}{[survivor] [31.9mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:41:31,815][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288153][33488] duration [8.2s], collections [1]/[9.2s], total [8.2s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [69.1mb]->[83.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:41:44,352][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288155][33489] duration [11.4s], collections [1]/[11.5s], total [11.4s]/[16.5h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[84.9mb]/[266.2mb]}{[survivor] [28.8mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:41:56,681][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 9969 numDocs: 9969 vs. true
[2017-09-18 08:41:56,683][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 5690 numDocs: 5690 vs. true
[2017-09-18 08:41:56,685][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288156][33490] duration [11.7s], collections [1]/[12.3s], total [11.7s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [84.9mb]->[76.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:42:05,442][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288157][33491] duration [8.2s], collections [1]/[8.7s], total [8.2s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [76.1mb]->[88.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:42:17,672][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 78285 numDocs: 78285 vs. true
[2017-09-18 08:42:17,673][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 33602 numDocs: 33602 vs. true
[2017-09-18 08:42:17,687][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288158][33492] duration [11.4s], collections [1]/[12.2s], total [11.4s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [88.7mb]->[96mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:42:28,584][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288159][33493] duration [10.4s], collections [1]/[10.8s], total [10.4s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [96mb]->[90.6mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:42:37,426][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288160][33494] duration [8.1s], collections [1]/[8.8s], total [8.1s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [90.6mb]->[88.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:42:49,094][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288161][33495] duration [10.8s], collections [1]/[11.6s], total [10.8s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [88.3mb]->[110.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:01,934][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288162][33496] duration [12s], collections [1]/[12.8s], total [12s]/[16.5h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [110.8mb]->[97.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:09,538][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288163][33497] duration [6.8s], collections [1]/[7.6s], total [6.8s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [97.1mb]->[92.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:20,623][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 8295 numDocs: 8295 vs. true
[2017-09-18 08:43:20,639][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 3450 numDocs: 3450 vs. true
[2017-09-18 08:43:20,639][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288164][33498] duration [10.3s], collections [1]/[11.1s], total [10.3s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [92.2mb]->[97.6mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:20,664][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2017.07.15][1], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 30.9s above the warn threshold of 30s
[2017-09-18 08:43:33,259][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288165][33499] duration [12.1s], collections [1]/[12.6s], total [12.1s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [97.6mb]->[115.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:41,886][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288166][33500] duration [8.3s], collections [1]/[8.6s], total [8.3s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [115.2mb]->[190.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:54,423][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288167][33501] duration [12.1s], collections [1]/[12.5s], total [12.1s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [190.8mb]->[96.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:43:54,513][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [async_shard_fetch] took 33.8s above the warn threshold of 30s
[2017-09-18 08:44:03,267][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288168][33502] duration [8.3s], collections [1]/[8.8s], total [8.3s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [96.8mb]->[107.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:44:14,787][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288169][33503] duration [10.9s], collections [1]/[11.5s], total [10.9s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [107.2mb]->[112.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:44:22,470][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 15060 numDocs: 15060 vs. true
[2017-09-18 08:44:22,573][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288170][33504] duration [7.3s], collections [1]/[7.7s], total [7.3s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [112.4mb]->[106.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:44:33,852][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288171][33505] duration [10.6s], collections [1]/[11.2s], total [10.6s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [106.3mb]->[112.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:44:41,804][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288172][33506] duration [7.1s], collections [1]/[7.9s], total [7.1s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [112.4mb]->[105mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:44:41,808][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 6079 numDocs: 6079 vs. true
[2017-09-18 08:44:55,814][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288173][33507] duration [13.1s], collections [1]/[13.9s], total [13.1s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [105mb]->[121.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:03,499][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288174][33508] duration [7.1s], collections [1]/[7.7s], total [7.1s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [121.4mb]->[120.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:15,507][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288175][33509] duration [11.1s], collections [1]/[12s], total [11.1s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [120.8mb]->[115.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:15,746][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [async_shard_fetch] took 33.1s above the warn threshold of 30s
[2017-09-18 08:45:23,666][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288176][33510] duration [7.2s], collections [1]/[8.1s], total [7.2s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [115.9mb]->[114.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:36,478][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288177][33511] duration [11.8s], collections [1]/[12.8s], total [11.8s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [114.5mb]->[134.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:43,969][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 82813 numDocs: 82813 vs. true
[2017-09-18 08:45:43,970][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288178][33512] duration [6.8s], collections [1]/[7.4s], total [6.8s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [134.4mb]->[105.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:45:43,976][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 16629 numDocs: 16629 vs. true
[2017-09-18 08:45:57,074][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288179][33513] duration [12.4s], collections [1]/[13.1s], total [12.4s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [105.2mb]->[122.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:46:09,113][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288180][33514] duration [11.3s], collections [1]/[12s], total [11.3s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [122.4mb]->[110.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:46:16,852][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288181][33515] duration [6.9s], collections [1]/[7.7s], total [6.9s]/[16.6h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [110.9mb]->[109.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:46:29,872][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288182][33516] duration [12s], collections [1]/[13s], total [12s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [109.7mb]->[142mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:46:30,547][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 13229 numDocs: 13229 vs. true
[2017-09-18 08:46:30,615][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 84952 numDocs: 84952 vs. true
[2017-09-18 08:46:37,706][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288183][33517] duration [7s], collections [1]/[7.8s], total [7s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [142mb]->[115.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:46:50,876][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288184][33518] duration [12.3s], collections [1]/[13.1s], total [12.3s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [115.7mb]->[122.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:02,633][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288185][33519] duration [11.1s], collections [1]/[11.7s], total [11.1s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [122.2mb]->[128.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:10,512][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288186][33520] duration [7.4s], collections [1]/[7.8s], total [7.4s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [128.7mb]->[124.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:21,692][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288187][33521] duration [10.6s], collections [1]/[11.1s], total [10.6s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [124.3mb]->[130.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:21,718][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 67795 numDocs: 67795 vs. true
[2017-09-18 08:47:21,732][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 79060 numDocs: 79060 vs. true
[2017-09-18 08:47:21,743][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2016.01.07][1], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 30.5s above the warn threshold of 30s
[2017-09-18 08:47:29,096][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288188][33522] duration [6.8s], collections [1]/[7.4s], total [6.8s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [130.1mb]->[120.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:42,079][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288190][33523] duration [11.7s], collections [1]/[11.9s], total [11.7s]/[16.6h], memory [3.8gb]->[3.7gb]/[3.8gb], all_pools {[young] [266.2mb]->[128.4mb]/[266.2mb]}{[survivor] [17.9mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:50,104][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288191][33524] duration [7.2s], collections [1]/[8s], total [7.2s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [128.4mb]->[123.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:47:51,125][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 5561 numDocs: 5561 vs. true
[2017-09-18 08:48:02,327][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2017.06.22][4], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 32.4s above the warn threshold of 30s
[2017-09-18 08:48:02,355][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288193][33525] duration [11.1s], collections [1]/[11.2s], total [11.1s]/[16.6h], memory [3.8gb]->[3.7gb]/[3.8gb], all_pools {[young] [266.2mb]->[147.2mb]/[266.2mb]}{[survivor] [26.4mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:11,481][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288194][33526] duration [8.5s], collections [1]/[9.1s], total [8.5s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [147.2mb]->[117.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:23,064][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288195][33527] duration [10.8s], collections [1]/[11.5s], total [10.8s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [117.7mb]->[125.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:31,034][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288196][33528] duration [7.3s], collections [1]/[7.9s], total [7.3s]/[16.6h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [125.3mb]->[117mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:31,037][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 285 numDocs: 285 vs. true
[2017-09-18 08:48:45,587][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288197][33529] duration [13.6s], collections [1]/[14.5s], total [13.6s]/[16.6h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [117mb]->[123.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:59,153][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288198][33530] duration [12.7s], collections [1]/[13.5s], total [12.7s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [123.3mb]->[129.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:48:59,972][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 9677 numDocs: 9677 vs. true
[2017-09-18 08:49:08,688][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288199][33531] duration [8.6s], collections [1]/[9.5s], total [8.6s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [129.4mb]->[125.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:49:21,744][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288200][33532] duration [12s], collections [1]/[13s], total [12s]/[16.6h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [125.1mb]->[125.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:49:22,671][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 10485 numDocs: 10485 vs. true
[2017-09-18 08:49:34,710][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288201][33533] duration [12s], collections [1]/[12.9s], total [12s]/[16.7h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [125.1mb]->[145.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:49:42,846][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288202][33534] duration [7.8s], collections [1]/[8.1s], total [7.8s]/[16.7h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [145.7mb]->[107.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:49:43,023][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2015.01.11][2], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 42.9s above the warn threshold of 30s
[2017-09-18 08:49:56,239][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288203][33535] duration [12.9s], collections [1]/[13.3s], total [12.9s]/[16.7h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [107.7mb]->[163.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:03,868][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288204][33536] duration [7.2s], collections [1]/[7.6s], total [7.2s]/[16.7h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [163.9mb]->[132.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:15,678][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288205][33537] duration [11s], collections [1]/[11.8s], total [11s]/[16.7h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [132.1mb]->[119.6mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:23,070][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288206][33538] duration [6.9s], collections [1]/[7.3s], total [6.9s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [119.6mb]->[113mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:23,229][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 2884 numDocs: 2884 vs. true
[2017-09-18 08:50:23,269][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2016.01.04][4], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 40.2s above the warn threshold of 30s
[2017-09-18 08:50:34,656][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288207][33539] duration [10.9s], collections [1]/[11.5s], total [10.9s]/[16.7h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [113mb]->[124.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:35,276][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 5978 numDocs: 5978 vs. true
[2017-09-18 08:50:48,025][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288209][33540] duration [12.2s], collections [1]/[12.3s], total [12.2s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[104.2mb]/[266.2mb]}{[survivor] [13.2mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:50:48,729][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 5408 numDocs: 5408 vs. true
[2017-09-18 08:50:57,533][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288210][33541] duration [8.7s], collections [1]/[9.5s], total [8.7s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [104.2mb]->[121mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:09,469][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288212][33542] duration [10.7s], collections [1]/[10.9s], total [10.7s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[107.5mb]/[266.2mb]}{[survivor] [11mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:09,480][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 11495 numDocs: 11495 vs. true
[2017-09-18 08:51:22,633][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288213][33543] duration [12.6s], collections [1]/[13.1s], total [12.6s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [107.5mb]->[111.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:30,040][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288214][33544] duration [7s], collections [1]/[7.4s], total [7s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [111.8mb]->[113.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:30,054][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 296 numDocs: 296 vs. true
[2017-09-18 08:51:43,369][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288215][33545] duration [12.7s], collections [1]/[13.3s], total [12.7s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [113.4mb]->[117.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:52,531][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288216][33546] duration [8.6s], collections [1]/[9.1s], total [8.6s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [117.4mb]->[111.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:51:52,540][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 10652 numDocs: 10652 vs. true
[2017-09-18 08:52:05,811][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288217][33547] duration [12.4s], collections [1]/[13.2s], total [12.4s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [111.8mb]->[105.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:52:06,001][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 84283 numDocs: 84283 vs. true
[2017-09-18 08:52:19,316][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288219][33548] duration [12.4s], collections [1]/[12.5s], total [12.4s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[102.9mb]/[266.2mb]}{[survivor] [26.5mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:52:19,323][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 85077 numDocs: 85077 vs. true
[2017-09-18 08:52:21,129][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 7416 numDocs: 7416 vs. true
[2017-09-18 08:52:34,108][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288221][33549] duration [12.9s], collections [1]/[13.7s], total [12.9s]/[16.7h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [205mb]->[114.3mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:52:47,819][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288223][33550] duration [12.6s], collections [1]/[12.7s], total [12.6s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[117.8mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:52:48,594][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 82103 numDocs: 82103 vs. true
[2017-09-18 08:52:57,136][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288224][33551] duration [8.3s], collections [1]/[9.3s], total [8.3s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [117.8mb]->[113.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:52:58,328][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 9412 numDocs: 9412 vs. true
[2017-09-18 08:52:58,660][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 67268 numDocs: 67268 vs. true
[2017-09-18 08:53:10,592][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 5262 numDocs: 5262 vs. true
[2017-09-18 08:53:10,596][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288227][33552] duration [10.7s], collections [1]/[11.4s], total [10.7s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [266.2mb]->[101.4mb]/[266.2mb]}{[survivor] [9.3mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:53:22,296][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288228][33553] duration [11.1s], collections [1]/[11.7s], total [11.1s]/[16.7h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [101.4mb]->[159.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:53:29,223][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 11500 numDocs: 11500 vs. true
[2017-09-18 08:53:29,251][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288229][33554] duration [6.7s], collections [1]/[6.9s], total [6.7s]/[16.7h], memory [3.7gb]->[3.8gb]/[3.8gb], all_pools {[young] [159.8mb]->[235.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:53:40,747][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288230][33555] duration [11.2s], collections [1]/[11.4s], total [11.2s]/[16.7h], memory [3.8gb]->[3.6gb]/[3.8gb], all_pools {[young] [235.9mb]->[102.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:53:40,823][WARN ][cluster.service          ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] cluster state update task [shard-started ([logstash-2017.02.27][4], node[4KlaIgx3Sdu8dSa9NVCUQQ], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[41a07432-8d31-4259-a3d5-9ba9c0379bad][L9BZoCYaT2i93EPP8ph0_g][pnls01lxv.mitchell.com][inet[/172.24.25.135:9300]]{max_local_storage_nodes=1}]]] took 41.1s above the warn threshold of 30s
[2017-09-18 08:53:49,807][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288231][33556] duration [8.5s], collections [1]/[9s], total [8.5s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [102.7mb]->[103.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:02,591][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288232][33557] duration [12.1s], collections [1]/[12.7s], total [12.1s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [103.9mb]->[108.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:02,605][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 6925 numDocs: 6925 vs. true
[2017-09-18 08:54:02,607][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 9830 numDocs: 9830 vs. true
[2017-09-18 08:54:12,119][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288233][33558] duration [8.8s], collections [1]/[9.5s], total [8.8s]/[16.7h], memory [3.6gb]->[3.6gb]/[3.8gb], all_pools {[young] [108.2mb]->[113.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:25,620][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 11511 numDocs: 11511 vs. true
[2017-09-18 08:54:25,656][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288234][33559] duration [12.5s], collections [1]/[13.5s], total [12.5s]/[16.7h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [113.7mb]->[127.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:35,196][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288235][33560] duration [8.6s], collections [1]/[9.5s], total [8.6s]/[16.7h], memory [3.7gb]->[3.6gb]/[3.8gb], all_pools {[young] [127.4mb]->[118.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:46,979][WARN ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288236][33561] duration [11.1s], collections [1]/[11.7s], total [11.1s]/[16.7h], memory [3.6gb]->[3.7gb]/[3.8gb], all_pools {[young] [118.2mb]->[133.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:55,847][INFO ][indices.recovery         ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] Recovery with sync ID 11950 numDocs: 11950 vs. true
[2017-09-18 08:54:55,998][INFO ][node                     ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] stopping ...
[2017-09-18 08:54:56,007][INFO ][monitor.jvm              ] [41a07432-8d31-4259-a3d5-9ba9c0379bad] [gc][old][288237][33562] duration [8.5s], collections [1]/[9s], total [8.5s]/[16.7h], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [133.7mb]->[128.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2017-09-18 08:54:56,158][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 1:22 pm
by cdienger
We may need to go back to around the time those Recovery with sync ID messages started. Can you PM me the /var/log/elasticsearch/* directory?

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 4:21 pm
by Jklre
PM sent. I went ahead and added 2gb of memory to both these nodes.

Thank you.

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 4:44 pm
by tacolover101
how much data do you currently have in open indices?

pending where this number is at, you may need to increase the ram further for better performance.

Re: Elastic cluster going into a bad state

Posted: Mon Sep 18, 2017 4:48 pm
by Jklre
Cluster Statistics

322,553,795
Documents

131.8GB
Primary Size

197.6GB
Total Size

2
Data Instances

10242
Total Shards

1025
Indices

Re: Elastic cluster going into a bad state

Posted: Tue Sep 19, 2017 10:13 am
by tmcdonald
It's usually memory with Log Server, historically speaking:

https://support.nagios.com/forum/viewto ... 37&t=33519

I would specifically draw your attention to this section:
DC6171 wrote:

Code: Select all

[root@logserver01 elasticsearch]# cat /etc/sysconfig/elasticsearch
# Directory where the Elasticsearch binary distribution resides
APP_DIR="/usr/local/nagioslogserver"
ES_HOME="$APP_DIR/elasticsearch"

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
which basically means that the Elasticsearch heap memory is set to half the system memory, which on an 8G system is only 4G. Plus, since you have only two nodes and they need to replicate data between themselves, each is essentially holding a full copy of all the logs. If you added a third (or more) instance, it would take on some of this burden. As you will see in the next link, you can scale up initially to a point before scaling out.

Starting on page 26 of this presentation, there are some recommendations given for performance tweaking:

https://www.slideshare.net/nagiosinc/da ... experience