Page 1 of 1

NLS 1.4.3 breaks snapshot backups

Posted: Fri Nov 11, 2016 5:52 pm
by nagiosnl_jorgen
After upgrading to Nagios Log Server 1.4.3 we discovered Backup & Maintenance backup jobs (curator snapshot) apparently failed, they were not even listed in the GUI.
Time to start debugging. Let us begin modifying the curator.sh script on all nodes (since this kind of subsystem job may run on any node) after making a copy.

Code: Select all

cp -p /usr/local/nagioslogserver/scripts/curator.sh{,.ORG}
Add some "echo" lines to curator.sh so it looks like this:

Code: Select all

[root@nls143 ~]# cat /usr/local/nagioslogserver/scripts/curator.sh
#!/bin/sh
date=$(date)
curator "$@"
echo --- >> /tmp/curatordebug.txt
echo $date >> /tmp/curatordebug.txt
echo "$@" >> /tmp/curatordebug.txt
Change the contents of curator.sh on every node. Then reschedule the "backup_maintenance" job in the GUI to run it a few seconds after "now".
Observe the results in /tmp/curatordebug.txt on every node. Below you'll find what I looked for using grep.

Code: Select all

[root@nls143 ~]# # grep -B1 ^snapshot /tmp/curatordebug.txt
Fri Nov 11 22:01:34 CET 2016
snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Let's try to run these arguments adding "curator" at the beginning of the command line

Code: Select all

[root@nls143 ~]# curator snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Error: no such option: --ignore_unavailable
According to "curator snapshot --help" --ignore_unavailable is listed under "OPTIONS" so it should be valid! NLS puts this one at the very end of the command and it might be seen as an ARGUMENT rather than an OPTION. Observe the Usage line closely please!

Code: Select all

[root@nls143 ~]# curator snapshot --help
Usage: curator snapshot [OPTIONS] COMMAND [ARGS]...

  Take snapshots of indices (Backup)

Options:
  --repository TEXT               Repository name.
  --name TEXT                     Override default name.
  --prefix TEXT                   Override default prefix.
  --wait_for_completion BOOLEAN   Wait for snapshot to complete before
                                  returning.  [default: True]
  --ignore_unavailable            Ignore unavailable shards/indices.
  --include_global_state BOOLEAN  Store cluster global state with snapshot.
                                  [default: True]
  --partial                       Do not fail if primary shard is unavailable.
  --request_timeout INTEGER       Allow this many seconds before the
                                  transaction times out.  [default: 21600]
  --skip-repo-validation          Skip repository access validation.
  --help                          Show this message and exit.

Commands:
  indices  Index selection.
So we have to move --ignore_unavailable on the command line to the OPTIONs section, right before the COMMAND (value: indices) and the Index Selection part begins.

That solves your problem! Look at this:

Code: Select all

# curator snapshot --repository SharedBackupRepo --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d
2016-11-11 22:18:47,869 INFO      Job starting: snapshot indices
(...)
2016-11-11 22:18:50,495 INFO      Snapshot name: curator-20161111211850
2016-11-11 22:18:54,476 INFO      Snapshot curator-20161111211850 successfully completed.
2016-11-11 22:18:54,476 INFO      Job completed successfully.
This info might help speed up Nagios developers to create fix for the bug introduced in version 1.4.3

Additional tip for NLS admins: use Nagios Core/XI to check snapshot backups: https://github.com/jvandermeulen/Nagios ... _backup.sh
In this particular case an UNKOWN state (UNKNOWN: Unable to determine result within last 25 hours: No snapshots matched provided args.) brought this to my attention. Default threshold is 24+1=25 hours. No backup results in UNKNOWN, failed backup results in CRITICAL, partial backup results in WARNING.

Jørgen van der Meulen

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 14, 2016 10:11 am
by mcapra
I have filed a bug report for this (ID 10109). The developers have been made aware of the issue. I sincerely apologize for the trouble this has caused and thank you for the research you've done into this issue.

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 28, 2016 2:32 pm
by gsl_ops_practice
Hello,

Can you please advise if this issue has been fixed in 1.4.4?

Thank you

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 28, 2016 2:35 pm
by mcapra
This has been reported fixed as of NLS 1.4.4:
https://assets.nagios.com/downloads/nag ... HANGES.TXT

Code: Select all

- Fixed curator script not accepting argument at the end of the command [TPS#10109] -JO

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 28, 2016 2:38 pm
by gsl_ops_practice
Thank you for a quick reply :)

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 28, 2016 2:59 pm
by mcapra
Sure thing! Did you have additional questions regarding this issue or may I close this thread?

Re: NLS 1.4.3 breaks snapshot backups

Posted: Mon Nov 28, 2016 3:52 pm
by gsl_ops_practice
I have no further questions for this thread, kindly consider this resolved.