Going through elasticsearch logs on the nodes - NAG2 complains about "out of memory exception", and NAG1 reports "failed recovery". So it seems like the story is that after NAG2 rebooted, it reconnected just fine, and then it started recovery (but because the cluster is being hammered) it couldn't handle the regular work + recovery overhead as it ran out of memory and died. NAG1 then saw the failure and reported the failed recovery (also possibly being hammered and being unable to serve the recovery).
How do I get out of this mess, with minimal loss of data?
Some logs to get started (ran shortly after issuing a elasticsearch and logstash restart on both machines):
NAG2:
Code: Select all
[root@schpnag2 ~]# tail -n 50 /var/log/elasticsearch/*.log
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Code: Select all
[root@schpnag1 ~]# tail -n 50 /var/log/elasticsearch/*.log
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
[2015-10-30 20:10:51,851][DEBUG][action.bulk ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[ a lot of these guys - suppressed for readability]
[2015-10-30 20:14:44,644][DEBUG][action.bulk ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:22:44,299][WARN ][indices.cluster ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] [[logstash-2015.10.24][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2015.10.24][4]: Recovery failed from [843eb4bb-fb4a-4166-9f69-a1cfd529a18d][3QeisfpZRdGibif3dxUlJQ][schpnag2][inet[/192.168.1.249:9300]]{max_local_storage_nodes=1} into [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b][pLZpLcfwR5CRk751U3JaEw][schpnag1][inet[/192.168.1.175:9300]]{max_local_storage_nod