Page 2 of 3
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 7:56 am
by jspink
2016-09-22 08_56_00-Index Status ยท Nagios Log Server.png
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 11:19 am
by mcapra
I agree with
@rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.
If you do a
ps aux | grep java, you should see that the
Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 12:34 pm
by jspink
mcapra wrote:I agree with
@rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.
If you do a
ps aux | grep java, you should see that the
Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.
With what you see/know about our enviro, is 32GB not enough? Being VMWare, we have plenty of RAM to throw at it if necessary to improve stability.
If we do add more memory, any config changes needed?
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 12:45 pm
by mcapra
Just as a note, we typically recommend no more than 64GB per node. This is due the Java heap becoming unstable when it passes 32GB. Following the previous logic, a provisioned 64GB would provide 32GB for your elasticsearch storage and around 32GB for your elasticsearch maintenance.
jspink wrote:With what you see/know about our enviro, is 32GB not enough?
I would say it's entering the realm of "not enough". Mostly due to a memory management breaker being tripped within one of the nodes:
Code: Select all
[2016-09-15 07:55:18,710][WARN ][indices.breaker ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking
This is sort of a prelude to full blown
OutOfMemory exceptions. Which generally result in shards failing and potential loss of data within indices.
jspink wrote:If we do add more memory, any config changes needed?
A simple restart of the elasticsearch and logstash processes would be required, but that is typically taken care of when you reboot the VM to add memory.
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 1:15 pm
by jspink
In my logstash config, I have the following custom setting:
# Arguments to pass to java
LS_HEAP_SIZE="2048m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"
I don't recall exactly what the default setting is, but the 2048m is NOT default - any issue there that you can see?
*Edit: default is 256m
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 1:22 pm
by mcapra
That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.
Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 1:26 pm
by jspink
mcapra wrote:That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.
Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.
In a previous post the LS_OPEN_FILES was already changed to 65000 (or so)
We'll work on the memory updates and report back
Re: Command Subsystem backup_maintenance job not completing
Posted: Thu Sep 22, 2016 1:37 pm
by mcapra
Awesome! Let me know if the command subsystem continues to cause problems.
Re: Command Subsystem backup_maintenance job not completing
Posted: Tue Sep 27, 2016 10:05 am
by jspink
Took a slightly different approach to this. Rather than just throw RAM at it, we decided to change our maint settings.
Original was:
Close indexes older than 16 days
Delete indexes older than 17 days
New settings:
Close indexes older than 11 days
Delete indexes older than 17 days
Things seem to be working much better, and our average mem usage is sitting between 30-35% now
Re: Command Subsystem backup_maintenance job not completing
Posted: Tue Sep 27, 2016 10:27 am
by rkennedy
I suspect it all has to do with the total amount of ram. Now that things are closing sooner this frees up quite a bit of it. How much data worth of logs do you have for the past 11 days now?