Command Subsystem backup_maintenance job not completing

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: Command Subsystem backup_maintenance job not completing

Post by jspink »

2016-09-22 08_56_00-Index Status · Nagios Log Server.png
You do not have the required permissions to view the files attached to this post.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Command Subsystem backup_maintenance job not completing

Post by mcapra »

I agree with @rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.

If you do a ps aux | grep java, you should see that the Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: Command Subsystem backup_maintenance job not completing

Post by jspink »

mcapra wrote:I agree with @rkennedy that running out of memory is probably the issue. Keep in mind that for each machine provisioned 32GB of memory, only half of that is being allocated to the regular storage tasks of elasticsearch. The other half is reserved for elasticsearch maintenance (and regular system) tasks.

If you do a ps aux | grep java, you should see that the Xmx value for the java process associated with elasticsearch is roughly half of the available memory on the machine. It's important to leave enough memory to handle maintenance tasks for stability purposes.
With what you see/know about our enviro, is 32GB not enough? Being VMWare, we have plenty of RAM to throw at it if necessary to improve stability.
If we do add more memory, any config changes needed?
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Command Subsystem backup_maintenance job not completing

Post by mcapra »

Just as a note, we typically recommend no more than 64GB per node. This is due the Java heap becoming unstable when it passes 32GB. Following the previous logic, a provisioned 64GB would provide 32GB for your elasticsearch storage and around 32GB for your elasticsearch maintenance.
jspink wrote:With what you see/know about our enviro, is 32GB not enough?
I would say it's entering the realm of "not enough". Mostly due to a memory management breaker being tripped within one of the nodes:

Code: Select all

[2016-09-15 07:55:18,710][WARN ][indices.breaker          ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking
This is sort of a prelude to full blown OutOfMemory exceptions. Which generally result in shards failing and potential loss of data within indices.
jspink wrote:If we do add more memory, any config changes needed?
A simple restart of the elasticsearch and logstash processes would be required, but that is typically taken care of when you reboot the VM to add memory.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: Command Subsystem backup_maintenance job not completing

Post by jspink »

In my logstash config, I have the following custom setting:
# Arguments to pass to java
LS_HEAP_SIZE="2048m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"

I don't recall exactly what the default setting is, but the 2048m is NOT default - any issue there that you can see?

*Edit: default is 256m
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Command Subsystem backup_maintenance job not completing

Post by mcapra »

That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.

Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: Command Subsystem backup_maintenance job not completing

Post by jspink »

mcapra wrote:That should be fine especially if (at the transport layer) the events are being distributed among Logstash instances.

Logstash doesn't require a particularly large heap. Usually, the first place Logstash runs into issues is related to the LS_OPEN_FILES directive. There's a different set of Java exceptions that wind up in the logs if that is encountered though.
In a previous post the LS_OPEN_FILES was already changed to 65000 (or so)
We'll work on the memory updates and report back
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Command Subsystem backup_maintenance job not completing

Post by mcapra »

Awesome! Let me know if the command subsystem continues to cause problems.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: Command Subsystem backup_maintenance job not completing

Post by jspink »

Took a slightly different approach to this. Rather than just throw RAM at it, we decided to change our maint settings.

Original was:

Close indexes older than 16 days
Delete indexes older than 17 days

New settings:

Close indexes older than 11 days
Delete indexes older than 17 days

Things seem to be working much better, and our average mem usage is sitting between 30-35% now
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Command Subsystem backup_maintenance job not completing

Post by rkennedy »

I suspect it all has to do with the total amount of ram. Now that things are closing sooner this frees up quite a bit of it. How much data worth of logs do you have for the past 11 days now?
Former Nagios Employee
Locked