NLS Cluster backups issue

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

NLS Cluster backups issue

Post by gsl_ops_practice »

Hello,

We are experiencing the following situation and would like to find a fix please:

1. 2-node NLS Cluster, version 1.4.0, backup is done to NFS mount /nfs_nagioslog_backups, backups are taken daily no problem
2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
3. Since backups older than 330 days are not deleted, and we ran out of unallocated disk space to add to the backup volume, I delete older backups manually from inside the indices directory.
4. After this NLS stops taking backups.

5. Upgraded to 1.4.4 last week, there were some fixes for backup issues, we hoped this would resolve our issue, but backups are still not happening.

Please advise what we can do to resolve. Attaching screenshot of the backup config.
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS Cluster backups issue

Post by mcapra »

gsl_ops_practice wrote: 2. Retention interval for backups is 330 days, but on /nfs_nagioslog_backups there are backups much older, almost 500 days.
I'll address this point first. When you adjust the "Delete backups older than" setting, what you are actually delegating is how long you wish to keep snapshots for. With the setting of 330 days, you are keeping 330 days worth of elasticsearch snapshots, not individual indices. An elasticsearch snapshot is exactly that; A snapshot of the current state of elasticsearch. Snapshots contain every single index currently open in elasticsearch. Therefore, if you are keeping 120 days of indices open at any given moment, and keeping 330 days worth of snapshots, you will have an effective ~450 days worth of retention (330 + 120).
gsl_ops_practice wrote:Please advise what we can do to resolve.
Can you share the output of the following command executed from the CLI of your Nagios Log Server machine (replace my_repository with the name of your repository):

Code: Select all

curator debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%d
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS Cluster backups issue

Post by gsl_ops_practice »

I learned something new :) 120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.

I am attaching output of the command in a text file, it is 263k. At the end of the command there was an error:

[root@nagioslog-1-nnn~]# curator --debug snapshot --repository my_repository indices --older-than 1 --time-unit days --timestring %Y.%m.%d > nagioslog_support.txt
Traceback (most recent call last):
File "/usr/bin/curator", line 11, in <module>
sys.exit(main())
File "/usr/lib/python2.6/site-packages/curator/curator.py", line 5, in main
cli( obj={ "filters": [] } )
File "/usr/lib/python2.6/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python2.6/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.6/site-packages/click/core.py", line 86, in augment_usage_errors
yield
File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/lib/python2.6/site-packages/curator/cli/index_selection.py", line 167, in indices
retval = do_command(client, ctx.parent.info_name, working_list, ctx.parent.params, master_timeout)
File "/usr/lib/python2.6/site-packages/curator/cli/utils.py", line 250, in do_command
skip_repo_validation=params['skip_repo_validation'],
File "/usr/lib/python2.6/site-packages/curator/api/snapshot.py", line 72, in create_snapshot
if name in all_snaps:
TypeError: argument of type 'bool' is not iterable
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS Cluster backups issue

Post by mcapra »

gsl_ops_practice wrote:I learned something new :) 120 + 330 does add up to what I was seeing before, will have to tweak that once the backups start working properly.
The language used is admittedly confusing. You're not the first person to ask these sorts of questions :)

Has the output you attached been altered in any way? It looks as if you ran the snapshot against my_repository. If that is your actual repository's name (it's blacked out in your original post, so I have no way of knowing), it looks as if you have corrupted snapshots:

Code: Select all

2016-12-06 18:39:01,802 WARNING            elasticsearch       log_request_fail:82   GET /_snapshot/my_repository/_all [status:400 request:0.444s]
2016-12-06 18:39:01,802 DEBUG              elasticsearch       log_request_fail:90   > None
2016-12-06 18:39:01,802 ERROR          curator.api.utils          get_snapshots:254  Unable to find all snapshots in repository: my_repository
I would suggest creating a new repository and attempting another snapshot using the previously mentioned command.
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS Cluster backups issue

Post by gsl_ops_practice »

Hi, I did a sed replace on the name of my actual repository, nothing else was altered.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS Cluster backups issue

Post by mcapra »

In that case:
mcapra wrote: I would suggest creating a new repository and attempting another snapshot using the previously mentioned command.
Whether or not the snapshot against the new repository succeeds/fails will help identify if it's an issue specific to the repository or if something within curator/elasticsearch is failing.
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS Cluster backups issue

Post by gsl_ops_practice »

Having one corrupted snapshot is something I can live with. I was trying to restore data that was 5 months old to run an analysis but that wasn't possible. Can we keep any of the archived snapshots or are you saying we need to start fresh?
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS Cluster backups issue

Post by gsl_ops_practice »

Creating a new repository is not very straightforward due to storage constraints, can we use a new directory inside the existing NFS mount as a new repository?
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS Cluster backups issue

Post by mcapra »

gsl_ops_practice wrote:can we use a new directory inside the existing NFS mount as a new repository?
That should be fine. The important thing is testing against a clean slate from the perspective of curator and elasticsearch.
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS Cluster backups issue

Post by gsl_ops_practice »

Sounds good, I will set up a new repository and we will see tomorrow if the backup job runs normally with a clean slate.
Locked