Command Subsystem backup_maintenance job not completing

jspink · Post by **jspink** » Thu Sep 22, 2016 7:56 am

2016-09-22 08_56_00-Index Status · Nagios Log Server.png

Post by **mcapra** » Thu Sep 22, 2016 11:19 am

I agree with @rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.

If you do a ps aux | grep java, you should see that the Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.

jspink · Post by **jspink** » Thu Sep 22, 2016 12:34 pm

mcapra wrote:I agree with @rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.

If you do a ps aux | grep java, you should see that the Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.

With what you see/know about our enviro, is 32GB not enough? Being VMWare, we have plenty of RAM to throw at it if necessary to improve stability.
If we do add more memory, any config changes needed?

Post by **mcapra** » Thu Sep 22, 2016 12:45 pm

Just as a note, we typically recommend no more than 64GB per node. This is due the Java heap becoming unstable when it passes 32GB. Following the previous logic, a provisioned 64GB would provide 32GB for your elasticsearch storage and around 32GB for your elasticsearch maintenance.

jspink wrote:With what you see/know about our enviro, is 32GB not enough?

I would say it's entering the realm of "not enough". Mostly due to a memory management breaker being tripped within one of the nodes:

Code: Select all

[2016-09-15 07:55:18,710][WARN ][indices.breaker          ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking

This is sort of a prelude to full blown OutOfMemory exceptions. Which generally result in shards failing and potential loss of data within indices.

jspink wrote:If we do add more memory, any config changes needed?

A simple restart of the elasticsearch and logstash processes would be required, but that is typically taken care of when you reboot the VM to add memory.

jspink · Post by **jspink** » Thu Sep 22, 2016 1:15 pm

In my logstash config, I have the following custom setting:
# Arguments to pass to java
LS_HEAP_SIZE="2048m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"

I don't recall exactly what the default setting is, but the 2048m is NOT default - any issue there that you can see?

*Edit: default is 256m

Post by **mcapra** » Thu Sep 22, 2016 1:22 pm

That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.

Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.

jspink · Post by **jspink** » Thu Sep 22, 2016 1:26 pm

mcapra wrote:That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.

Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.

In a previous post the LS_OPEN_FILES was already changed to 65000 (or so)
We'll work on the memory updates and report back

Post by **mcapra** » Thu Sep 22, 2016 1:37 pm

Awesome! Let me know if the command subsystem continues to cause problems.

jspink · Post by **jspink** » Tue Sep 27, 2016 10:05 am

Took a slightly different approach to this. Rather than just throw RAM at it, we decided to change our maint settings.

Original was:

Close indexes older than 16 days
Delete indexes older than 17 days

New settings:

Close indexes older than 11 days
Delete indexes older than 17 days

Things seem to be working much better, and our average mem usage is sitting between 30-35% now

rkennedy · Post by **rkennedy** » Tue Sep 27, 2016 10:27 am

I suspect it all has to do with the total amount of ram. Now that things are closing sooner this frees up quite a bit of it. How much data worth of logs do you have for the past 11 days now?

Nagios Support Forum

Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing