Command Subsystem backup_maintenance job not completing

jspink · Post by **jspink** » Fri Sep 16, 2016 12:06 pm

For the past week or so, in our current 9 node cluster, the backup_maintenance job starts, but never finishes. We end up having to close old indices, and deleting to free up space, as the scheduled job never completes.

Is there something I can provide to assist in troubleshooting this?

Post by **mcapra** » Fri Sep 16, 2016 12:23 pm

A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.

Also please share the full output of:

Code: Select all

curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%d

Replacing <repository_name> with the name of your repository.

jspink · Post by **jspink** » Fri Sep 16, 2016 12:28 pm

mcapra wrote:A good place to start is the elasticsearch logs (/var/log/elasticsearch/<cluster_id>.log). There's a few exceptions that elasticsearch will throw when it encounters issues like this each with their own meaning.

Also please share the full output of:
Code: Select all
curator --dry-run --debug snapshot --repository <repository_name> indices --older-than 1 --time-unit days --timestring %Y.%m.%d
Replacing <repository_name> with the name of your repository.

For further clarification on that - right now, we are not completing any backups due to the sheer size, and lack of compression. So no repository exists at the moment.
Depending on this backup_maintenance job to execute the following:

2016-09-16 13_28_05-Backup _ Maintenance · Nagios Log Server.png

Post by **mcapra** » Fri Sep 16, 2016 12:51 pm

Ah gotcha, misunderstanding on my part.

New set of curator commands then:

Code: Select all

curator --dry-run --debug optimize indices --older-than 2 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug close indices --older-than 16 --time-unit days --timestring %Y.%m.%d
curator --dry-run --debug delete indices --older-than 17 --time-unit days --timestring %Y.%m.%d

These are basically the commands that get run as part of the backup_maintenance job.

jspink · Post by **jspink** » Fri Sep 16, 2016 3:10 pm

Attached txt files with sequential names matching each of the 3 commands

curator1.txt

curator2.txt

curator3.txt

Post by **mcapra** » Mon Sep 19, 2016 9:16 am

Nothing interesting in the curator outputs. I'd like to see the elasticsearch logs as well:

Code: Select all

/var/log/elasticsearch/<cluster_id>.log

You may need to go a bit farther back in the logs depending on when the issue happened. All of the archived logs should have a YYYYMMDD timestamp appended to them. If the issue happened as recently as the 16th, I would be interested in seeing the logs for that particular day.

jspink · Post by **jspink** » Mon Sep 19, 2016 12:04 pm

See attached GZ log file - i believe this should be what you were looking for

5bf474f6-3664-4f18-a80b-d7a3ac03f8ef.log-20160916.gz

rkennedy · Post by **rkennedy** » Mon Sep 19, 2016 1:07 pm

How much memory does each instance have allocated now?

It looks like the issues are related to running out, but not quite crashing as the 'breaker' steps in-

Code: Select all

[2016-09-15 07:55:18,710][WARN ][indices.breaker          ] [f3c66e59-29ad-439d-91c9-c2f2049ac660] [FIELDDATA] New used memory 10152273205 [9.4gb] from field [message] would be larger than configured breaker: 10099988889 [9.4gb], breaking

jspink · Post by **jspink** » Tue Sep 20, 2016 11:56 am

Each instance has 32GB allocated.

Interestingly, our Friday logs are gone (from about 10pm night before to about 6pm Friday Sept 16), but now the maint job seems to be working as expected since Monday night.
No changes, no reboots.

Post by **mcapra** » Tue Sep 20, 2016 1:11 pm

If the problem persists, the first place I would check is the elasticsearch log to see if there's any meaningful Java exceptions being caught.

What are the average sizes of your indices? If you could share a screenshot of the "Index Status" page, that would be helpful in evaluating if the machines need additional resources.

Nagios Support Forum

Command Subsystem backup_maintenance job not completing

Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing

Re: Command Subsystem backup_maintenance job not completing