Page 2 of 2

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Mon Nov 30, 2015 7:55 pm
by mark.payne
We are on version 1.3

python /usr/lib/python2.7/site-packages/curator/curator.py show --show-snapshots --repository 'backup'
This out puts the following:

Code: Select all

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
    main()
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
    arguments.func(client, **argdict)
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 184, in show
    for snapshot in get_snaplist(client, kwargs['repository'], prefix=kwargs['prefix']):
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 234, in get_snaplist
    allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
    repository, snapshot), params=params)
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'OutOfMemoryError[Java heap space]')
python /usr/lib/python2.7/site-packages/curator/curator.py snapshot --older-than 1 --repository 'backup'
This outputs the following:

Code: Select all

2015-12-01 13:52:11,430 INFO      Job starting...
2015-12-01 13:52:11,430 INFO      Default timeout of 30 seconds is too low for command SNAPSHOT.  Overriding to 21,600 seconds (6 hours).
2015-12-01 13:52:11,435 INFO      Beginning SNAPSHOT operations...
2015-12-01 13:52:11,441 INFO      Attempting to create snapshot for index logstash-2015.10.22.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
    main()
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
    arguments.func(client, **argdict)
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 585, in command_loop
    skipped = op(client, index_name, **kwargs)
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 401, in _create_snapshot
    successes = get_snapped_indices(client, repository, prefix=prefix)
  File "/usr/lib/python2.7/site-packages/curator/curator.py", line 245, in get_snapped_indices
    allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
    repository, snapshot), params=params)
  File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'OutOfMemoryError[Java heap space]')
python /usr/lib/python2.7/site-packages/curator/curator.py close --older-than 100
This outputs the following:

Code: Select all

2015-12-01 13:53:20,121 INFO      Job starting...
2015-12-01 13:53:20,124 INFO      Beginning CLOSE operations...
2015-12-01 13:53:20,129 INFO      logstash-2015.10.22 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.23 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.24 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.25 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.26 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.27 is within the threshold period (100 days).
2015-12-01 13:53:20,129 INFO      logstash-2015.10.28 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.10.29 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.10.30 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.10.31 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.01 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.02 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.03 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.04 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.05 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.06 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.07 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.08 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.09 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.10 is within the threshold period (100 days).
2015-12-01 13:53:20,130 INFO      logstash-2015.11.11 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.12 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.13 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.14 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.15 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.16 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.17 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.18 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.19 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.20 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.21 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.22 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.23 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.24 is within the threshold period (100 days).
2015-12-01 13:53:20,131 INFO      logstash-2015.11.25 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.11.26 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.11.27 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.11.28 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.11.29 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.11.30 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      logstash-2015.12.01 is within the threshold period (100 days).
2015-12-01 13:53:20,132 INFO      CLOSE index operations completed.
2015-12-01 13:53:20,132 INFO      Done in 0:00:00.017672.
Also to add to this I can delete snapshots for 2015.10.22 from the backup repository then run the backup and it creates indices folder and metadata, snapshot files but errors out after awhile.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Dec 01, 2015 11:08 am
by jolson
After running those curator commands, is there anything notable in your elasticsearch log? It seems like Elasticsearch is refusing the Curator (backup) connection.

Code: Select all

tail -n500 /var/log/elasticsearch/*.log

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Dec 01, 2015 5:33 pm
by mark.payne
I reran the following commands at 11:30 local time on each server:
python /usr/lib/python2.7/site-packages/curator/curator.py show --show-snapshots --repository 'backup'
python /usr/lib/python2.7/site-packages/curator/curator.py snapshot --older-than 1 --repository 'backup'

Then I ran on each server:
tail -n500 /var/log/elasticsearch/*.log

Server 1

Code: Select all

[2015-12-02 10:50:31,019][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 14732292 numDocs: 14732292 vs. true
[2015-12-02 10:50:31,580][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 24570312 numDocs: 24570312 vs. true
[2015-12-02 10:50:31,812][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 724734 numDocs: 724734 vs. true
[2015-12-02 10:50:31,897][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 14334927 numDocs: 14334927 vs. true
[2015-12-02 10:50:32,102][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 14679545 numDocs: 14679545 vs. true
[2015-12-02 10:50:32,324][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 5361 numDocs: 5361 vs. true
[2015-12-02 10:50:32,336][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 23042685 numDocs: 23042685 vs. true
[2015-12-02 10:50:32,717][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 14788833 numDocs: 14788833 vs. true
[2015-12-02 10:50:32,869][INFO ][indices.recovery         ] [d1e4c296-b0e6-4d29-a854-62197b986998] Recovery with sync ID 929627 numDocs: 929627 vs. true
Server 2

Code: Select all

[2015-12-02 10:49:05,078][INFO ][discovery.zen            ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] master_left [[d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2015-12-02 10:49:05,080][WARN ][discovery.zen            ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] master left (reason = transport disconnected), current nodes: {[ead9d787-7f47-41c8-b504-76ccabba3bd2][ZggggQe2SfSubGhqVqN28A][annznls1.dps.local][inet[/192.168.6.131:9300]]{max_local_storage_nodes=1},}
[2015-12-02 10:49:05,081][INFO ][cluster.service          ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] removed {[d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1})
[2015-12-02 10:49:11,124][INFO ][cluster.service          ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] detected_master [d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1}, added {[d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[d1e4c296-b0e6-4d29-a854-62197b986998][1OVqHzxGTBCn8tbPeAPEgw][elnznls1.dps.local][inet[/192.168.136.131:9300]]{max_local_storage_nodes=1}])
[2015-12-02 11:24:00,426][WARN ][transport.netty          ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] exception caught on transport layer [[id: 0xe58291ed, /192.168.136.131:40765 => /192.168.6.131:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (7d,1b,d,a)
        at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-12-02 11:24:00,428][WARN ][transport.netty          ] [ead9d787-7f47-41c8-b504-76ccabba3bd2] exception caught on transport layer [[id: 0xe58291ed, /192.168.136.131:40765 :> /192.168.6.131:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (7d,1b,d,a)
        at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:482)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireChannelDisconnected(Channels.java:396)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:360)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:81)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
        at org.elasticsearch.common.netty.channel.Channels.close(Channels.java:812)
        at org.elasticsearch.common.netty.channel.AbstractChannel.close(AbstractChannel.java:206)
        at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:638)
        at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
        at org.elasticsearch.common.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.notifyHandlerException(DefaultChannelPipeline.java:658)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:566)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
The times in the logs are from before running the backups but still show some errors.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Dec 01, 2015 6:23 pm
by jolson
1.3
Just to be clear, are you on version 2014R1.3 or version 1.3.0? There was a version scheme change recently, and I want to be sure you're up to date. The backup system was fixed up in 1.3.0, I'd like to ensure that you're on the latest.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Dec 01, 2015 7:28 pm
by mark.payne
We are on the latest 1.3.0
This is the version we installed, not an upgrade from a earlier version.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Wed Dec 02, 2015 2:31 pm
by jolson
Conversation continued via PM - we discovered that this is a curator issue, and the newest version of curator resolves the problem. Currently, the newest version of curator does not work in Nagios Log Server because we need to customize it to work appropriately with the system - I will contact the developers and get back to you soon on that point. Thank you Mark!