Page 1 of 2

Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 6:38 pm
by mark.payne
I have two servers in a cluster each with a backup share mounted locally.
It managed to backup the first days backup then stopped working. No "backup snapshots" are present in Backup and Maintenance.
Both servers can access the share. I have reset all jobs and rerun but still doesn't work.
I can see in the logs that it shows the following:

2015-11-25 11:55:05,553 INFO Job starting...
2015-11-25 11:55:05,556 INFO Beginning SNAPSHOT operations...
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
main()
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
arguments.func(client, **argdict)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 566, in command_loop
snapshot_list = get_object_list(client, data_type='snapshot', **kwargs)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 279, in get_object_list
object_list = get_snaplist(client, repository, prefix=prefix)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 234, in get_snaplist
allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
repository, snapshot), params=params)
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'RemoteTransportException[[d1e4c296-b0e6-4d29-a854-62197b986998][inet[/192.168.136.131:9300]][cluster:admin/snapshot/get]]; nested: OutOfMemoryError[Java heap space]; ')

I can see in the ElasticSearch config that the heap size is:
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Both servers have 16GB of memory with 36% free.

So close to getting the proof of concept done and tested so I can purchase and put into production.

Any help would be appreciated.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 7:01 pm
by Box293
Can you try this please:

Edit /etc/sysconfig/elasticsearch
Uncomment ES_HEAP_SIZE=1g
Change it to:

Code: Select all

ES_HEAP_SIZE=4g
Save and then restart the elasticsearch service.

mark.payne wrote:I have two servers in a cluster each with a backup share mounted locally.
It managed to backup the first days backup then stopped working. No "backup snapshots" are present in Backup and Maintenance.
Both servers can access the share. I have reset all jobs and rerun but still doesn't work.
I want to confirm that it's the same central share mounted on both servers in the same location.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 7:36 pm
by mark.payne
Currently it is set to ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
If I set this to ES_HEAP_SIZE=4g I believe this would give ES less memory than is already being allocated.

I tried it as 4g and now there is 65% free memory out of 16GB when it was 36% free so ES doesn't have as much allocated as before.
This is causing the interface to hang and become unresponsive.
I had to revert it back to what it was.

It is the exact same central network share mounted in the same location locally.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 8:13 pm
by mark.payne
I changed the heap size to 8g and this was ok but still received the same issue.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 9:01 pm
by Box293
Not sure if you can add more RAM to the servers, but if you can, can you make each instance 32GB in size and revert the settings I told you to do:
Currently it is set to ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Tue Nov 24, 2015 9:30 pm
by mark.payne
I increased each server to 32GB.
Still getting the same error:

2015-11-25 15:29:29,804 INFO Beginning SNAPSHOT operations...
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
main()
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
arguments.func(client, **argdict)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 566, in command_loop
snapshot_list = get_object_list(client, data_type='snapshot', **kwargs)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 279, in get_object_list
object_list = get_snaplist(client, repository, prefix=prefix)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 234, in get_snaplist
allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
repository, snapshot), params=params)
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'RemoteTransportException[[d1e4c296-b0e6-4d29-a854-62197b986998][inet[/192.168.136.131:9300]][cluster:admin/snapshot/get]]; nested: OutOfMemoryError; ')

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Wed Nov 25, 2015 11:28 am
by jolson
Mark Payne,

I'd like the following information from you.

md5sum of your curator.py file.

Code: Select all

md5sum /usr/lib/python2.6/site-packages/curator/curator.py
Some curl debugging:

Code: Select all

curl  localhost:9200/_snapshot/_all
curl  localhost:9200/_snapshot/
curl -XPOST 'http://localhost:9200/_export/state'
curl 'localhost:9200/_cluster/health?level=indices&pretty'
You are also free to set the ES_HEAP_SIZE variable back to what it was set to originally: $(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Let me know the results of the above commands - in my experience you either have an improper curator.py binary or there's a stuck snapshot somewhere. Thanks!

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Wed Nov 25, 2015 3:42 pm
by mark.payne
I have reverted the HEAP changes back to default.

Information requested below:

Code: Select all

md5sum /usr/lib/python2.7/site-packages/curator/curator.py
9d19626b8486f05156c77a0dacc93343  /usr/lib/python2.7/site-packages/curator/curator.py

curl  localhost:9200/_snapshot/_all 
{"backup":{"type":"fs","settings":{"compress":"true","location":"/mnt/backup/nagios"}}}

curl  localhost:9200/_snapshot/
{"backup":{"type":"fs","settings":{"compress":"true","location":"/mnt/backup/nagios"}}}

curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}

curl 'localhost:9200/_cluster/health?level=indices&pretty'
  "cluster_name" : "88244341-3928-4ea7-9363-93d3c9b771ca",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 186,
  "active_shards" : 372,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "indices" : {
    "nagioslogserver" : {
      "status" : "green",
      "number_of_shards" : 1,
      "number_of_replicas" : 1,
      "active_primary_shards" : 1,
      "active_shards" : 2,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.03" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.02" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.05" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.04" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.07" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.25" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.24" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.06" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.09" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.08" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "nagioslogserver_log" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.10" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.11" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.12" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.30" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.31" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.19" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.18" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.17" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.16" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.15" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.14" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.13" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.22" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.23" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.20" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.21" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.22" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.01" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.26" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.25" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "kibana-int" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.24" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.23" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.29" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.28" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.27" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    }
  }
}

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Wed Nov 25, 2015 7:53 pm
by Box293
Thanks for that. It's currently Thanksgiving holidays in the USA and the support office is closed. I would not expect a reply until next week.

Re: Backups failing | OutOfMemoryError[Java heap space]

Posted: Mon Nov 30, 2015 6:28 pm
by jolson
All of your information looks okay.

What revision on NLS are you using? The latest revision includes several backup fixes. If you're not on the latest revision, I urge you to upgrade.

Try to run curator manually to fetch up your snapshot status:

Code: Select all

python /usr/lib/python2.6/site-packages/curator/curator.py show --show-snapshots --repository 'backup'
Can you initiate any curator commands and have them finish appropriately?

Code: Select all

python /usr/lib/python2.6/site-packages/curator/curator.py snapshot --older-than 1 --repository 'backup'
python /usr/lib/python2.6/site-packages/curator/curator.py close --older-than 100
I'm interested in any errors you encounter along the way. Thank you!