NLS Cluster backups issue
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
NLS Cluster backups issue
Hello,
We are experiencing the following situation and would like to find a fix please:
1. 2-node NLS Cluster, version 1.4.0, backup is done to NFS mount /nfs_nagioslog_backups, backups are taken daily no problem
2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
3. Since backups older than 330 days are not deleted, and we ran out of unallocated disk space to add to the backup volume, I delete older backups manually from inside the indices directory.
4. After this NLS stops taking backups.
5. Upgraded to 1.4.4 last week, there were some fixes for backup issues, we hoped this would resolve our issue, but backups are still not happening.
Please advise what we can do to resolve. Attaching screenshot of the backup config.
We are experiencing the following situation and would like to find a fix please:
1. 2-node NLS Cluster, version 1.4.0, backup is done to NFS mount /nfs_nagioslog_backups, backups are taken daily no problem
2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
3. Since backups older than 330 days are not deleted, and we ran out of unallocated disk space to add to the backup volume, I delete older backups manually from inside the indices directory.
4. After this NLS stops taking backups.
5. Upgraded to 1.4.4 last week, there were some fixes for backup issues, we hoped this would resolve our issue, but backups are still not happening.
Please advise what we can do to resolve. Attaching screenshot of the backup config.
You do not have the required permissions to view the files attached to this post.
Re: NLS Cluster backups issue
I'll address this point first. When you adjust the "Delete backups older than" setting, what you are actually delegating is how long you wish to keep snapshots for. With the setting of 330 days, you are keeping 330 days worth of elasticsearch snapshots, not individual indices. An elasticsearch snapshot is exactly that; A snapshot of the current state of elasticsearch. Snapshots contain every single index currently open in elasticsearch. Therefore, if you are keeping 120 days of indices open at any given moment, and keeping 330 days worth of snapshots, you will have an effective ~450 days worth of retention (330 + 120).gsl_ops_practice wrote: 2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
Can you share the output of the following command executed from the CLI of your Nagios Log Server machine (replace my_repository with the name of your repository):gsl_ops_practice wrote:Please advise what we can do to resolve.
Code: Select all
curator debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%dFormer Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS Cluster backups issue
I learned something new
120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.
I am attaching output of the command in a text file, it is 263k. At the end of the command there was an error:
I am attaching output of the command in a text file, it is 263k. At the end of the command there was an error:
[root@nagioslog-1-nnn~]# curator --debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%d > nagioslog_support.txt
Traceback (most recent call last):
File "/usr/bin/curator", line 11, in <module>
sys.exit(main())
File "/usr/lib/python2.6/site-packages/curator/curator.py", line 5, in main
cli( obj={ "filters": [] } )
File "/usr/lib/python2.6/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.6/site-packages/click/core.py", line 86, in augment_usage_errors
yield
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/lib/python2.6/site-packages/curator/cli/index_selection.py", line 167, in indices
retval = do_command(client, ctx.parent.info_name, working_list, ctx.parent.params, master_timeout)
File "/usr/lib/python2.6/site-packages/curator/cli/utils.py", line 250, in do_command
skip_repo_validation=params['skip_repo_validation'],
File "/usr/lib/python2.6/site-packages/curator/api/snapshot.py", line 72, in create_snapshot
if name in all_snaps:
TypeError: argument of type 'bool' is not iterable
You do not have the required permissions to view the files attached to this post.
Re: NLS Cluster backups issue
The language used is admittedly confusing. You're not the first person to ask these sorts of questionsgsl_ops_practice wrote:I learned something new120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.
Has the output you attached been altered in any way? It looks as if you ran the snapshot against my_repository. If that is your actual repository's name (it's blacked out in your original post, so I have no way of knowing), it looks as if you have corrupted snapshots:
Code: Select all
2016-12-06 18:39:01,802 WARNING elasticsearch log_request_fail:82 GET /_snapshot/my_repository/_all [status:400 request:0.444s]
2016-12-06 18:39:01,802 DEBUG elasticsearch log_request_fail:90 > None
2016-12-06 18:39:01,802 ERROR curator.api.utils get_snapshots:254 Unable to find all snapshots in repository: my_repositoryFormer Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS Cluster backups issue
Hi, I did a sed replace on the name of my actual repository, nothing else was altered.
Re: NLS Cluster backups issue
In that case:
Whether or not the snapshot against the new repository succeeds/fails will help identify if it's an issue specific to the repository or if something within curator/elasticsearch is failing.mcapra wrote: I would suggest creating a new repository and attempting another snapshot using the previously mentioned command.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS Cluster backups issue
Having one corrupted snapshot is something I can live with. I was trying to restore data that was 5 months old to run an analysis but that wasn't possible. Can we keep any of the archived snapshots or are you saying we need to start fresh?
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS Cluster backups issue
Creating a new repository is not very straightforward due to storage constraints, can we use a new directory inside the existing NFS mount as a new repository?
Re: NLS Cluster backups issue
That should be fine. The important thing is testing against a clean slate from the perspective of curator and elasticsearch.gsl_ops_practice wrote:can we use a new directory inside the existing NFS mount as a new repository?
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
gsl_ops_practice
- Posts: 151
- Joined: Thu Apr 09, 2015 9:14 pm
Re: NLS Cluster backups issue
Sounds good, I will set up a new repository and we will see tomorrow if the backup job runs normally with a clean slate.