Backups failing | OutOfMemoryError[Java heap space]

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
mark.payne
Posts: 22
Joined: Mon Sep 14, 2015 11:25 pm

Backups failing | OutOfMemoryError[Java heap space]

Post by mark.payne »

I have two servers in a cluster each with a backup share mounted locally.
It managed to backup the first days backup then stopped working. No "backup snapshots" are present in Backup and Maintenance.
Both servers can access the share. I have reset all jobs and rerun but still doesn't work.
I can see in the logs that it shows the following:

2015-11-25 11:55:05,553 INFO Job starting...
2015-11-25 11:55:05,556 INFO Beginning SNAPSHOT operations...
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
main()
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
arguments.func(client, **argdict)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 566, in command_loop
snapshot_list = get_object_list(client, data_type='snapshot', **kwargs)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 279, in get_object_list
object_list = get_snaplist(client, repository, prefix=prefix)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 234, in get_snaplist
allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
repository, snapshot), params=params)
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'RemoteTransportException[[d1e4c296-b0e6-4d29-a854-62197b986998][inet[/192.168.136.131:9300]][cluster:admin/snapshot/get]]; nested: OutOfMemoryError[Java heap space]; ')

I can see in the ElasticSearch config that the heap size is:
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Both servers have 16GB of memory with 36% free.

So close to getting the proof of concept done and tested so I can purchase and put into production.

Any help would be appreciated.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by Box293 »

Can you try this please:

Edit /etc/sysconfig/elasticsearch
Uncomment ES_HEAP_SIZE=1g
Change it to:

Code: Select all

ES_HEAP_SIZE=4g
Save and then restart the elasticsearch service.

mark.payne wrote:I have two servers in a cluster each with a backup share mounted locally.
It managed to backup the first days backup then stopped working. No "backup snapshots" are present in Backup and Maintenance.
Both servers can access the share. I have reset all jobs and rerun but still doesn't work.
I want to confirm that it's the same central share mounted on both servers in the same location.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
mark.payne
Posts: 22
Joined: Mon Sep 14, 2015 11:25 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by mark.payne »

Currently it is set to ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
If I set this to ES_HEAP_SIZE=4g I believe this would give ES less memory than is already being allocated.

I tried it as 4g and now there is 65% free memory out of 16GB when it was 36% free so ES doesn't have as much allocated as before.
This is causing the interface to hang and become unresponsive.
I had to revert it back to what it was.

It is the exact same central network share mounted in the same location locally.
mark.payne
Posts: 22
Joined: Mon Sep 14, 2015 11:25 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by mark.payne »

I changed the heap size to 8g and this was ok but still received the same issue.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by Box293 »

Not sure if you can add more RAM to the servers, but if you can, can you make each instance 32GB in size and revert the settings I told you to do:
Currently it is set to ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
mark.payne
Posts: 22
Joined: Mon Sep 14, 2015 11:25 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by mark.payne »

I increased each server to 32GB.
Still getting the same error:

2015-11-25 15:29:29,804 INFO Beginning SNAPSHOT operations...
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 736, in <module>
main()
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 731, in main
arguments.func(client, **argdict)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 566, in command_loop
snapshot_list = get_object_list(client, data_type='snapshot', **kwargs)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 279, in get_object_list
object_list = get_snaplist(client, repository, prefix=prefix)
File "/usr/lib/python2.7/site-packages/curator/curator.py", line 234, in get_snaplist
allsnaps = client.snapshot.get(repository=repo_name, snapshot="_all")['snapshots']
File "/usr/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python2.7/site-packages/elasticsearch/client/snapshot.py", line 58, in get
repository, snapshot), params=params)
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.TransportError: TransportError(500, u'RemoteTransportException[[d1e4c296-b0e6-4d29-a854-62197b986998][inet[/192.168.136.131:9300]][cluster:admin/snapshot/get]]; nested: OutOfMemoryError; ')
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by jolson »

Mark Payne,

I'd like the following information from you.

md5sum of your curator.py file.

Code: Select all

md5sum /usr/lib/python2.6/site-packages/curator/curator.py
Some curl debugging:

Code: Select all

curl  localhost:9200/_snapshot/_all
curl  localhost:9200/_snapshot/
curl -XPOST 'http://localhost:9200/_export/state'
curl 'localhost:9200/_cluster/health?level=indices&pretty'
You are also free to set the ES_HEAP_SIZE variable back to what it was set to originally: $(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

Let me know the results of the above commands - in my experience you either have an improper curator.py binary or there's a stuck snapshot somewhere. Thanks!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
mark.payne
Posts: 22
Joined: Mon Sep 14, 2015 11:25 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by mark.payne »

I have reverted the HEAP changes back to default.

Information requested below:

Code: Select all

md5sum /usr/lib/python2.7/site-packages/curator/curator.py
9d19626b8486f05156c77a0dacc93343  /usr/lib/python2.7/site-packages/curator/curator.py

curl  localhost:9200/_snapshot/_all 
{"backup":{"type":"fs","settings":{"compress":"true","location":"/mnt/backup/nagios"}}}

curl  localhost:9200/_snapshot/
{"backup":{"type":"fs","settings":{"compress":"true","location":"/mnt/backup/nagios"}}}

curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}

curl 'localhost:9200/_cluster/health?level=indices&pretty'
  "cluster_name" : "88244341-3928-4ea7-9363-93d3c9b771ca",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 186,
  "active_shards" : 372,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "indices" : {
    "nagioslogserver" : {
      "status" : "green",
      "number_of_shards" : 1,
      "number_of_replicas" : 1,
      "active_primary_shards" : 1,
      "active_shards" : 2,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.03" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.02" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.05" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.04" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.07" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.25" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.24" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.06" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.09" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.08" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "nagioslogserver_log" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.10" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.11" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.12" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.30" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.31" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.19" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.18" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.17" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.16" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.15" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.14" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.13" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.22" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.23" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.20" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.21" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.22" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.11.01" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.26" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.25" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "kibana-int" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.24" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.23" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.29" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.28" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    },
    "logstash-2015.10.27" : {
      "status" : "green",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0
    }
  }
}
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by Box293 »

Thanks for that. It's currently Thanksgiving holidays in the USA and the support office is closed. I would not expect a reply until next week.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Backups failing | OutOfMemoryError[Java heap space]

Post by jolson »

All of your information looks okay.

What revision on NLS are you using? The latest revision includes several backup fixes. If you're not on the latest revision, I urge you to upgrade.

Try to run curator manually to fetch up your snapshot status:

Code: Select all

python /usr/lib/python2.6/site-packages/curator/curator.py show --show-snapshots --repository 'backup'
Can you initiate any curator commands and have them finish appropriately?

Code: Select all

python /usr/lib/python2.6/site-packages/curator/curator.py snapshot --older-than 1 --repository 'backup'
python /usr/lib/python2.6/site-packages/curator/curator.py close --older-than 100
I'm interested in any errors you encounter along the way. Thank you!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked