NLS 1.4.3 breaks snapshot backups

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
nagiosnl_jorgen
Posts: 6
Joined: Tue Apr 22, 2014 8:39 am

NLS 1.4.3 breaks snapshot backups

Post by nagiosnl_jorgen »

After upgrading to Nagios Log Server 1.4.3 we discovered Backup & Maintenance backup jobs (curator snapshot) apparently failed, they were not even listed in the GUI.
Time to start debugging. Let us begin modifying the curator.sh script on all nodes (since this kind of subsystem job may run on any node) after making a copy.

Code: Select all

cp -p /usr/local/nagioslogserver/scripts/curator.sh{,.ORG}
Add some "echo" lines to curator.sh so it looks like this:

Code: Select all

[root@nls143 ~]# cat /usr/local/nagioslogserver/scripts/curator.sh
#!/bin/sh
date=$(date)
curator "$@"
echo --- >> /tmp/curatordebug.txt
echo $date >> /tmp/curatordebug.txt
echo "$@" >> /tmp/curatordebug.txt
Change the contents of curator.sh on every node. Then reschedule the "backup_maintenance" job in the GUI to run it a few seconds after "now".
Observe the results in /tmp/curatordebug.txt on every node. Below you'll find what I looked for using grep.

Code: Select all

[root@nls143 ~]# # grep -B1 ^snapshot /tmp/curatordebug.txt
Fri Nov 11 22:01:34 CET 2016
snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Let's try to run these arguments adding "curator" at the beginning of the command line

Code: Select all

[root@nls143 ~]# curator snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Error: no such option: --ignore_unavailable
According to "curator snapshot --help" --ignore_unavailable is listed under "OPTIONS" so it should be valid! NLS puts this one at the very end of the command and it might be seen as an ARGUMENT rather than an OPTION. Observe the Usage line closely please!

Code: Select all

[root@nls143 ~]# curator snapshot --help
Usage: curator snapshot [OPTIONS] COMMAND [ARGS]...

  Take snapshots of indices (Backup)

Options:
  --repository TEXT               Repository name.
  --name TEXT                     Override default name.
  --prefix TEXT                   Override default prefix.
  --wait_for_completion BOOLEAN   Wait for snapshot to complete before
                                  returning.  [default: True]
  --ignore_unavailable            Ignore unavailable shards/indices.
  --include_global_state BOOLEAN  Store cluster global state with snapshot.
                                  [default: True]
  --partial                       Do not fail if primary shard is unavailable.
  --request_timeout INTEGER       Allow this many seconds before the
                                  transaction times out.  [default: 21600]
  --skip-repo-validation          Skip repository access validation.
  --help                          Show this message and exit.

Commands:
  indices  Index selection.
So we have to move --ignore_unavailable on the command line to the OPTIONs section, right before the COMMAND (value: indices) and the Index Selection part begins.

That solves your problem! Look at this:

Code: Select all

# curator snapshot --repository SharedBackupRepo --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d
2016-11-11 22:18:47,869 INFO      Job starting: snapshot indices
(...)
2016-11-11 22:18:50,495 INFO      Snapshot name: curator-20161111211850
2016-11-11 22:18:54,476 INFO      Snapshot curator-20161111211850 successfully completed.
2016-11-11 22:18:54,476 INFO      Job completed successfully.
This info might help speed up Nagios developers to create fix for the bug introduced in version 1.4.3

Additional tip for NLS admins: use Nagios Core/XI to check snapshot backups: https://github.com/jvandermeulen/Nagios ... _backup.sh
In this particular case an UNKOWN state (UNKNOWN: Unable to determine result within last 25 hours: No snapshots matched provided args.) brought this to my attention. Default threshold is 24+1=25 hours. No backup results in UNKNOWN, failed backup results in CRITICAL, partial backup results in WARNING.

Jørgen van der Meulen
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by mcapra »

I have filed a bug report for this (ID 10109). The developers have been made aware of the issue. I sincerely apologize for the trouble this has caused and thank you for the research you've done into this issue.
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by gsl_ops_practice »

Hello,

Can you please advise if this issue has been fixed in 1.4.4?

Thank you
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by mcapra »

This has been reported fixed as of NLS 1.4.4:
https://assets.nagios.com/downloads/nag ... HANGES.TXT

Code: Select all

- Fixed curator script not accepting argument at the end of the command [TPS#10109] -JO
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by gsl_ops_practice »

Thank you for a quick reply :)
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by mcapra »

Sure thing! Did you have additional questions regarding this issue or may I close this thread?
Former Nagios employee
https://www.mcapra.com/
gsl_ops_practice
Posts: 151
Joined: Thu Apr 09, 2015 9:14 pm

Re: NLS 1.4.3 breaks snapshot backups

Post by gsl_ops_practice »

I have no further questions for this thread, kindly consider this resolved.
Locked