NLS Cluster backups issue

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 9:31 am

Hello,

We are experiencing the following situation and would like to find a fix please:

1. 2-node NLS Cluster, version 1.4.0, backup is done to NFS mount /nfs_nagioslog_backups, backups are taken daily no problem
2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
3. Since backups older than 330 days are not deleted, and we ran out of unallocated disk space to add to the backup volume, I delete older backups manually from inside the indices directory.
4. After this NLS stops taking backups.

5. Upgraded to 1.4.4 last week, there were some fixes for backup issues, we hoped this would resolve our issue, but backups are still not happening.

Please advise what we can do to resolve. Attaching screenshot of the backup config.

Post by **mcapra** » Tue Dec 06, 2016 11:49 am

gsl_ops_practice wrote: 2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.

I'll address this point first. When you adjust the "Delete backups older than" setting, what you are actually delegating is how long you wish to keep snapshots for. With the setting of 330 days, you are keeping 330 days worth of elasticsearch snapshots, not individual indices. An elasticsearch snapshot is exactly that; A snapshot of the current state of elasticsearch. Snapshots contain every single index currently open in elasticsearch. Therefore, if you are keeping 120 days of indices open at any given moment, and keeping 330 days worth of snapshots, you will have an effective ~450 days worth of retention (330 + 120).

gsl_ops_practice wrote:Please advise what we can do to resolve.

Can you share the output of the following command executed from the CLI of your Nagios Log Server machine (replace my_repository with the name of your repository):

Code: Select all

curator debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%d

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 1:48 pm

I learned something new

120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.

I am attaching output of the command in a text file, it is 263k. At the end of the command there was an error:

[root@nagioslog-1-nnn~]# curator --debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%d > nagioslog_support.txt
Traceback (most recent call last):
File "/usr/bin/curator", line 11, in <module>
sys.exit(main())
File "/usr/lib/python2.6/site-packages/curator/curator.py", line 5, in main
cli( obj={ "filters": [] } )
File "/usr/lib/python2.6/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.6/site-packages/click/core.py", line 86, in augment_usage_errors
yield
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/lib/python2.6/site-packages/curator/cli/index_selection.py", line 167, in indices
retval = do_command(client, ctx.parent.info_name, working_list, ctx.parent.params, master_timeout)
File "/usr/lib/python2.6/site-packages/curator/cli/utils.py", line 250, in do_command
skip_repo_validation=params['skip_repo_validation'],
File "/usr/lib/python2.6/site-packages/curator/api/snapshot.py", line 72, in create_snapshot
if name in all_snaps:
TypeError: argument of type 'bool' is not iterable

Post by **mcapra** » Tue Dec 06, 2016 2:10 pm

gsl_ops_practice wrote:I learned something new 120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.

The language used is admittedly confusing. You're not the first person to ask these sorts of questions

Has the output you attached been altered in any way? It looks as if you ran the snapshot against my_repository. If that is your actual repository's name (it's blacked out in your original post, so I have no way of knowing), it looks as if you have corrupted snapshots:

Code: Select all

2016-12-06 18:39:01,802 WARNING            elasticsearch       log_request_fail:82   GET /_snapshot/my_repository/_all [status:400 request:0.444s]
2016-12-06 18:39:01,802 DEBUG              elasticsearch       log_request_fail:90   > None
2016-12-06 18:39:01,802 ERROR          curator.api.utils          get_snapshots:254  Unable to find all snapshots in repository: my_repository

I would suggest creating a new repository and attempting another snapshot using the previously mentioned command.

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 2:12 pm

Hi, I did a sed replace on the name of my actual repository, nothing else was altered.

Post by **mcapra** » Tue Dec 06, 2016 2:14 pm

In that case:

mcapra wrote: I would suggest creating a new repository and attempting another snapshot using the previously mentioned command.

Whether or not the snapshot against the new repository succeeds/fails will help identify if it's an issue specific to the repository or if something within curator/elasticsearch is failing.

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 2:15 pm

Having one corrupted snapshot is something I can live with. I was trying to restore data that was 5 months old to run an analysis but that wasn't possible. Can we keep any of the archived snapshots or are you saying we need to start fresh?

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 2:19 pm

Creating a new repository is not very straightforward due to storage constraints, can we use a new directory inside the existing NFS mount as a new repository?

Post by **mcapra** » Tue Dec 06, 2016 2:28 pm

gsl_ops_practice wrote:can we use a new directory inside the existing NFS mount as a new repository?

That should be fine. The important thing is testing against a clean slate from the perspective of curator and elasticsearch.

gsl_ops_practice · Post by **gsl_ops_practice** » Tue Dec 06, 2016 3:01 pm

Sounds good, I will set up a new repository and we will see tomorrow if the backup job runs normally with a clean slate.

Nagios Support Forum

NLS Cluster backups issue

NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue

Re: NLS Cluster backups issue