Command Subsystem backup_maintenance job not completing
Command Subsystem backup_maintenance job not completing
For the past week or so, in our current 9 node cluster, the backup_maintenance job starts, but never finishes. We end up having to close old indices, and deleting to free up space, as the scheduled job never completes.
Is there something I can provide to assist in troubleshooting this?
Is there something I can provide to assist in troubleshooting this?
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: Command Subsystem backup_maintenance job not completing
A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.
Also please share the full output of:
Replacing <repository_name> with the name of your repository.
Also please share the full output of:
Code: Select all
curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%dFormer Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Command Subsystem backup_maintenance job not completing
For further clarification on that - right now, we are not completing any backups due to the sheer size, and lack of compression. So no repository exists at the moment.mcapra wrote:A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.
Also please share the full output of:
Replacing <repository_name> with the name of your repository.Code: Select all
curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%d
Depending on this backup_maintenance job to execute the following:
You do not have the required permissions to view the files attached to this post.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: Command Subsystem backup_maintenance job not completing
Ah gotcha, misunderstanding on my part.
New set of curator commands then:
These are basically the commands that get run as part of the backup_maintenance job.
New set of curator commands then:
Code: Select all
curator --dry-run --debug optimize indices --older-than 2 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug close indices --older-than 16 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug delete indices --older-than 17 --time-unit days --timestring %Y.%m.%d
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Command Subsystem backup_maintenance job not completing
Attached txt files with sequential names matching each of the 3 commands
You do not have the required permissions to view the files attached to this post.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: Command Subsystem backup_maintenance job not completing
Nothing interesting in the curator outputs. I'd like to see the elasticsearch logs as well:
You may need to go a bit farther back in the logs depending on when the issue happened. All of the archived logs should have a YYYYMMDD timestamp appended to them. If the issue happened as recently as the 16th, I would be interested in seeing the logs for that particular day.
Code: Select all
/var/log/elasticsearch/<cluster_id>.log
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Command Subsystem backup_maintenance job not completing
See attached GZ log file - i believe this should be what you were looking for
You do not have the required permissions to view the files attached to this post.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: Command Subsystem backup_maintenance job not completing
How much memory does each instance have allocated now?
It looks like the issues are related to running out, but not quite crashing as the 'breaker' steps in-
It looks like the issues are related to running out, but not quite crashing as the 'breaker' steps in-
Code: Select all
[2016-09-15 07:55:18,710][WARN ][indices.breaker ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking
Former Nagios Employee
Re: Command Subsystem backup_maintenance job not completing
Each instance has 32GB allocated.
Interestingly, our Friday logs are gone (from about 10pm night before to about 6pm Friday Sept 16), but now the maint job seems to be working as expected since Monday night.
No changes, no reboots.
Interestingly, our Friday logs are gone (from about 10pm night before to about 6pm Friday Sept 16), but now the maint job seems to be working as expected since Monday night.
No changes, no reboots.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: Command Subsystem backup_maintenance job not completing
If the problem persists, the first place I would check is the elasticsearch log to see if there's any meaningful Java exceptions being caught.
What are the average sizes of your indices? If you could share a screenshot of the "Index Status" page, that would be helpful in evaluating if the machines need additional resources.
What are the average sizes of your indices? If you could share a screenshot of the "Index Status" page, that would be helpful in evaluating if the machines need additional resources.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/