Failed Recovery - Dead Cluster.
Posted: Mon Nov 02, 2015 12:46 pm
Back story -- 2 node cluster (lets call them NAG1 and 2). The web gui was extremely slow, but cluster status (and a local look at the TOP) didn't really reveal any resource constraints - CPU/memory/swap usage was about average. My guess is that the system was probably being hammered anyway. Out of ideas, I figured I was going to reboot NAG2, it would resync with 1, and maybe we'd have some performance improvement. After the sync completed, I intended to reboot NAG1. Currently, both nodes are down (Elasticsearch and Logstash both crash shortly after start-up).
Going through elasticsearch logs on the nodes - NAG2 complains about "out of memory exception", and NAG1 reports "failed recovery". So it seems like the story is that after NAG2 rebooted, it reconnected just fine, and then it started recovery (but because the cluster is being hammered) it couldn't handle the regular work + recovery overhead as it ran out of memory and died. NAG1 then saw the failure and reported the failed recovery (also possibly being hammered and being unable to serve the recovery).
How do I get out of this mess, with minimal loss of data?
Some logs to get started (ran shortly after issuing a elasticsearch and logstash restart on both machines):
NAG2:
NAG1:
Going through elasticsearch logs on the nodes - NAG2 complains about "out of memory exception", and NAG1 reports "failed recovery". So it seems like the story is that after NAG2 rebooted, it reconnected just fine, and then it started recovery (but because the cluster is being hammered) it couldn't handle the regular work + recovery overhead as it ran out of memory and died. NAG1 then saw the failure and reported the failed recovery (also possibly being hammered and being unable to serve the recovery).
How do I get out of this mess, with minimal loss of data?
Some logs to get started (ran shortly after issuing a elasticsearch and logstash restart on both machines):
NAG2:
Code: Select all
[root@schpnag2 ~]# tail -n 50 /var/log/elasticsearch/*.log
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Code: Select all
[root@schpnag1 ~]# tail -n 50 /var/log/elasticsearch/*.log
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
[2015-10-30 20:10:51,851][DEBUG][action.bulk ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[ a lot of these guys - suppressed for readability]
[2015-10-30 20:14:44,644][DEBUG][action.bulk ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:22:44,299][WARN ][indices.cluster ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] [[logstash-2015.10.24][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2015.10.24][4]: Recovery failed from [843eb4bb-fb4a-4166-9f69-a1cfd529a18d][3QeisfpZRdGibif3dxUlJQ][schpnag2][inet[/192.168.1.249:9300]]{max_local_storage_nodes=1} into [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b][pLZpLcfwR5CRk751U3JaEw][schpnag1][inet[/192.168.1.175:9300]]{max_local_storage_nod