Page 1 of 3
Command Subsystem backup_maintenance job not completing
Posted: Fri Sep 16, 2016 12:06 pm
by jspink
For the past week or so, in our current 9 node cluster, the backup_maintenance job starts, but never finishes. We end up having to close old indices, and deleting to free up space, as the scheduled job never completes.
Is there something I can provide to assist in troubleshooting this?
Re: Command Subsystem backup_maintenance job not completing
Posted: Fri Sep 16, 2016 12:23 pm
by mcapra
A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.
Also please share the full output of:
Code: Select all
curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%d
Replacing <repository_name> with the name of your repository.
Re: Command Subsystem backup_maintenance job not completing
Posted: Fri Sep 16, 2016 12:28 pm
by jspink
mcapra wrote:A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.
Also please share the full output of:
Code: Select all
curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%d
Replacing <repository_name> with the name of your repository.
For further clarification on that - right now, we are not completing any backups due to the sheer size, and lack of compression. So no repository exists at the moment.
Depending on this backup_maintenance job to execute the following:
2016-09-16 13_28_05-Backup _ Maintenance ยท Nagios Log Server.png
Re: Command Subsystem backup_maintenance job not completing
Posted: Fri Sep 16, 2016 12:51 pm
by mcapra
Ah gotcha, misunderstanding on my part.
New set of curator commands then:
Code: Select all
curator --dry-run --debug optimize indices --older-than 2 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug close indices --older-than 16 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug delete indices --older-than 17 --time-unit days --timestring %Y.%m.%d
These are basically the commands that get run as part of the backup_maintenance job.
Re: Command Subsystem backup_maintenance job not completing
Posted: Fri Sep 16, 2016 3:10 pm
by jspink
Attached txt files with sequential names matching each of the 3 commands
curator1.txt
curator2.txt
curator3.txt
Re: Command Subsystem backup_maintenance job not completing
Posted: Mon Sep 19, 2016 9:16 am
by mcapra
Nothing interesting in the curator outputs. I'd like to see the elasticsearch logs as well:
Code: Select all
/var/log/elasticsearch/<cluster_id>.log
You may need to go a bit farther back in the logs depending on when the issue happened. All of the archived logs should have a YYYYMMDD timestamp appended to them. If the issue happened as recently as the 16th, I would be interested in seeing the logs for that particular day.
Re: Command Subsystem backup_maintenance job not completing
Posted: Mon Sep 19, 2016 12:04 pm
by jspink
See attached GZ log file - i believe this should be what you were looking for
5bf474f6-3664-4f18-a80b-d7a3ac03f8ef.log-20160916.gz
Re: Command Subsystem backup_maintenance job not completing
Posted: Mon Sep 19, 2016 1:07 pm
by rkennedy
How much memory does each instance have allocated now?
It looks like the issues are related to running out, but not quite crashing as the 'breaker' steps in-
Code: Select all
[2016-09-15 07:55:18,710][WARN ][indices.breaker ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking
Re: Command Subsystem backup_maintenance job not completing
Posted: Tue Sep 20, 2016 11:56 am
by jspink
Each instance has 32GB allocated.
Interestingly, our Friday logs are gone (from about 10pm night before to about 6pm Friday Sept 16), but now the maint job seems to be working as expected since Monday night.
No changes, no reboots.
Re: Command Subsystem backup_maintenance job not completing
Posted: Tue Sep 20, 2016 1:11 pm
by mcapra
If the problem persists, the first place I would check is the elasticsearch log to see if there's any meaningful Java exceptions being caught.
What are the average sizes of your indices? If you could share a screenshot of the "Index Status" page, that would be helpful in evaluating if the machines need additional resources.