Failed Recovery - Dead Cluster.

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Failed Recovery - Dead Cluster.

Post by polarbear1 »

Back story -- 2 node cluster (lets call them NAG1 and 2). The web gui was extremely slow, but cluster status (and a local look at the TOP) didn't really reveal any resource constraints - CPU/memory/swap usage was about average. My guess is that the system was probably being hammered anyway. Out of ideas, I figured I was going to reboot NAG2, it would resync with 1, and maybe we'd have some performance improvement. After the sync completed, I intended to reboot NAG1. Currently, both nodes are down (Elasticsearch and Logstash both crash shortly after start-up).

Going through elasticsearch logs on the nodes - NAG2 complains about "out of memory exception", and NAG1 reports "failed recovery". So it seems like the story is that after NAG2 rebooted, it reconnected just fine, and then it started recovery (but because the cluster is being hammered) it couldn't handle the regular work + recovery overhead as it ran out of memory and died. NAG1 then saw the failure and reported the failed recovery (also possibly being hammered and being unable to serve the recovery).


How do I get out of this mess, with minimal loss of data?

Some logs to get started (ran shortly after issuing a elasticsearch and logstash restart on both machines):


NAG2:

Code: Select all

[root@schpnag2 ~]# tail -n 50 /var/log/elasticsearch/*.log

==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
        at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
        at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
        at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
        at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
        at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:33,397][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by a user handler while handling an exception event ([id: 0x6ffd592b, /192.168.1.249:55151 :> /192.168.1.175:9300] EXCEPTION: java.nio.channels.ClosedChannelException)
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.netty.NettyTransport.disconnectFromNodeChannel(NettyTransport.java:946)
        at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:612)
        at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
        at org.elasticsearch.common.netty.channel.Channels$7.run(Channels.java:499)
        at org.elasticsearch.common.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-11-02 11:33:20,728][DEBUG][action.admin.indices.stats] [843eb4bb-fb4a-4166-9f69-a1cfd529a18d] [logstash-2015.05.19][4], node[pLZpLcfwR5CRk751U3JaEw], [R], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@52b4ba65]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:287)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.start(TransportBroadcastOperationAction.java:150)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:71)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction.doExecute(TransportBroadcastOperationAction.java:47)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:267)
        at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:356)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
NAG1:

Code: Select all

[root@schpnag1 ~]# tail -n 50 /var/log/elasticsearch/*.log

==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
[2015-10-30 20:10:51,851][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[ a lot of these guys - suppressed for readability]
[2015-10-30 20:14:44,644][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:22:44,299][WARN ][indices.cluster          ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] [[logstash-2015.10.24][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2015.10.24][4]: Recovery failed from [843eb4bb-fb4a-4166-9f69-a1cfd529a18d][3QeisfpZRdGibif3dxUlJQ][schpnag2][inet[/192.168.1.249:9300]]{max_local_storage_nodes=1} into [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b][pLZpLcfwR5CRk751U3JaEw][schpnag1][inet[/192.168.1.175:9300]]{max_local_storage_nod
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Failed Recovery - Dead Cluster.

Post by jolson »

There are really two options here - the first option being the recommended path to resolution.

Increase the amount of memory in both of your nodes and reboot them - this will ensure that the nodes can handle the load that we're discussing. Double the memory if possible, if that's not possible give them all of the memory that you can.

Let me know if adding additional memory to your nodes is not possible. The out of memory error heavily implies that you're lacking the necessary memory to perform a recovery procedure in addition to handling your normal log volume.

Also, are you currently a customer? I see you posting in General Support, which doesn't have our SLA guarantee - if you're a customer I'd like to get your posts moved over to our Customer Support forum.

Best,


Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Failed Recovery - Dead Cluster.

Post by polarbear1 »

We are a customer, I've been looking at how to get migrated to the Customer forum but was unable to find any information on that. Please send me the relevant information in the PM.

Will do a data center run tomorrow and up the ram. Currently they have 16 gig each. Will probably end up with 64 gig a pop after the upgrade.
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Failed Recovery - Dead Cluster.

Post by hsmith »

I'm going to put the information here in case anyone is having the same issue:

If you are a customer, and you do not have customer forum access, please email [email protected]. On the support side of the spectrum, we're unable to look up your account details, or modify them.

Thanks!
Former Nagios Employee.
me.
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Failed Recovery - Dead Cluster.

Post by polarbear1 »

Ok - I have access to customer forum now if you need to migrate this thread there.

Got the servers upgraded from 16 to 64 gig.

Some relevant memory settings I was using:

Code: Select all

LS_HEAP_SIZE="1024m" 
although I also tried commenting it out.

Code: Select all

ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
which translates to...
[root@schpnag1 ~]# expr $(free -m|awk '/^Mem:/{print $2}') / 2
32272


Logstash error:

Code: Select all

Exception in thread "Ruby-0-Thread-4: /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-syslog-0.1.6/lib/logstash/inputs/syslog.rb:96" java.nio.BufferOverflowException
        at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183)
        at org.jruby.util.io.ChannelStream.bufferedWrite(ChannelStream.java:1100)
        at org.jruby.util.io.ChannelStream.fwrite(ChannelStream.java:1277)
        at org.jruby.RubyIO.fwrite(RubyIO.java:1541)
        at org.jruby.RubyIO.write(RubyIO.java:1412)
        at org.jruby.RubyIO$INVOKER$i$1$0$write.call(RubyIO$INVOKER$i$1$0$write.gen)
        at org.jruby.RubyClass.finvoke(RubyClass.java:742)
        at org.jruby.runtime.Helpers.invoke(Helpers.java:503)
        at org.jruby.RubyBasicObject.callMethod(RubyBasicObject.java:363)
        at org.jruby.RubyIO.write(RubyIO.java:2490)
        at org.jruby.RubyIO.putsSingle(RubyIO.java:2478)
        at org.jruby.RubyIO.puts1(RubyIO.java:2407)
        at org.jruby.RubyIO.puts(RubyIO.java:2380)
        at org.jruby.RubyIO$INVOKER$i$puts.call(RubyIO$INVOKER$i$puts.gen)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
        at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.IfNode.interpret(IfNode.java:116)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)
        at org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)
        at org.jruby.runtime.Block.yield(Block.java:142)
        at org.jruby.ext.thread.Mutex.synchronize(Mutex.java:149)
        at org.jruby.ext.thread.Mutex$INVOKER$i$0$0$synchronize.call(Mutex$INVOKER$i$0$0$synchronize.gen)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)
        at org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
        at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:182)
        at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:203)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
        at org.jruby.runtime.callsite.ShiftLeftCallSite.call(ShiftLeftCallSite.java:24)
        at org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
        at org.jruby.ast.IfNode.interpret(IfNode.java:116)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)
        at org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)
        at org.jruby.runtime.Block.yield(Block.java:142)
        at org.jruby.RubyHash$13.visit(RubyHash.java:1354)
        at org.jruby.RubyHash.visitLimited(RubyHash.java:648)
        at org.jruby.RubyHash.visitAll(RubyHash.java:634)
        at org.jruby.RubyHash.iteratorVisitAll(RubyHash.java:1305)
        at org.jruby.RubyHash.each_pairCommon(RubyHash.java:1350)
        at org.jruby.RubyHash.each19(RubyHash.java:1341)
        at org.jruby.RubyHash$INVOKER$i$0$0$each19.call(RubyHash$INVOKER$i$0$0$each19.gen)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)
        at org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)
        at org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:157)
        at org.jruby.runtime.Block.yield(Block.java:142)
        at org.jruby.ext.thread.Mutex.synchronize(Mutex.java:149)
        at org.jruby.ext.thread.Mutex$INVOKER$i$0$0$synchronize.call(Mutex$INVOKER$i$0$0$synchronize.gen)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)
        at org.jruby.ast.CallNoArgBlockNode.interpret(CallNoArgBlockNode.java:64)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
        at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:182)
        at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:203)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
        at org.jruby.ast.FCallOneArgNode.interpret(FCallOneArgNode.java:36)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
        at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)
        at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)
        at org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
        at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:268)
        at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:235)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:236)
        at org.jruby.ast.FCallThreeArgNode.interpret(FCallThreeArgNode.java:40)
        at org.jruby.ast.IfNode.interpret(IfNode.java:116)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)
        at org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)
        at org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)
        at org.jruby.runtime.Block.call(Block.java:101)
        at org.jruby.RubyProc.call(RubyProc.java:290)
        at org.jruby.internal.runtime.methods.ProcMethod.call(ProcMethod.java:64)
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:218)
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:214)
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:346)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:204)
        at org.jruby.ast.CallTwoArgNode.interpret(CallTwoArgNode.java:59)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.ast.IfNode.interpret(IfNode.java:116)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
        at org.jruby.ast.RescueBodyNode.interpret(RescueBodyNode.java:108)
        at org.jruby.ast.RescueNode.handleException(RescueNode.java:174)
        at org.jruby.ast.RescueNode.interpret(RescueNode.java:120)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
        at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:225)
        at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:219)
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:346)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:204)
        at org.jruby.ast.FCallTwoArgNode.interpret(FCallTwoArgNode.java:38)
        at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
        at org.jruby.evaluator.ASTInterpreter.INTERPRET_BLOCK(ASTInterpreter.java:112)
        at org.jruby.runtime.Interpreted19Block.evalBlockBody(Interpreted19Block.java:206)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:194)
        at org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)
        at org.jruby.runtime.Block.call(Block.java:101)
        at org.jruby.RubyProc.call(RubyProc.java:290)
        at org.jruby.RubyProc.call(RubyProc.java:228)
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:99)
        at java.lang.Thread.run(Thread.java:745)
and elasticsearch error:

Code: Select all

[root@schpnag1 ~]# tail -n 50 /var/log/elasticsearch/*.log
==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38_index_indexing_slowlog.log <==

==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38_index_search_slowlog.log <==

==> /var/log/elasticsearch/4f703585-84ab-40e0-9ff9-f72c904bdc38.log <==
[2015-10-30 20:10:51,851][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,852][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,852][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,853][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,853][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,854][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,854][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,855][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,856][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,856][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,856][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,857][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,857][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,858][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,859][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,861][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,861][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,862][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,862][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,864][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,865][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,866][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,868][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,870][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,870][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,871][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,871][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,872][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,874][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,875][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,876][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,877][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:10:51,878][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1.2m]
[2015-10-30 20:10:51,878][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:13:28,151][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,152][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,153][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,153][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,154][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,155][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,155][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,157][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,160][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:13:28,163][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2015-10-30 20:14:44,641][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:14:44,642][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:14:44,643][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:14:44,644][DEBUG][action.bulk              ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] observer timed out. notifying listener. timeout setting [1m], time since start [2.3m]
[2015-10-30 20:22:44,299][WARN ][indices.cluster          ] [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b] [[logstash-2015.10.24][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [logstash-2015.10.24][4]: Recovery failed from [843eb4bb-fb4a-4166-9f69-a1cfd529a18d][3QeisfpZRdGibif3dxUlJQ][schpnag2][inet[/192.168.1.249:9300]]{max_local_storage_nodes=1} into [ea9ddcd0-c0a5-4d5d-a802-e741d9c51a5b][pLZpLcfwR5CRk751U3JaEw][schpnag1][inet[/192.168.1.175:9300]]{max_local_storage_nod
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Failed Recovery - Dead Cluster.

Post by jolson »

What is your problem currently? Does Logstash crash after the buffer overflow error appears in your logs? What about Elasticsearch?

In general, the observer timeout log that you're seeing in Elasticsearch is caused when your nodes cannot contact one another. Please show me the output of the following:

Code: Select all

cat /usr/local/nagioslogserver/var/cluster_hosts
cat /usr/local/nagioslogserver/var/cluster_uuid
cat /usr/local/nagioslogserver/var/node_uuid
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Failed Recovery - Dead Cluster.

Post by polarbear1 »

So good news.

Looks like all the failure has filled up my /var/log/elasticsearch to the point where my /var partition was 100% used up. Cleared out some space, and the thing started right up. Doing the recovery now from the last few days of being down. (my actual storage partition is on a dedicated drive off the /var).


Lessons learned:
* When things are starting to chug, throw more ram at it.
* Check your /var partition.


You can close this thread.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Failed Recovery - Dead Cluster.

Post by jolson »

Good find. Five times out of ten when it comes to NLS, it's a RAM problem. :geek: Have a good one!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked