NLS 1.4.3 breaks snapshot backups
Posted: Fri Nov 11, 2016 5:52 pm
After upgrading to Nagios Log Server 1.4.3 we discovered Backup & Maintenance backup jobs (curator snapshot) apparently failed, they were not even listed in the GUI.
Time to start debugging. Let us begin modifying the curator.sh script on all nodes (since this kind of subsystem job may run on any node) after making a copy.
Add some "echo" lines to curator.sh so it looks like this:
Change the contents of curator.sh on every node. Then reschedule the "backup_maintenance" job in the GUI to run it a few seconds after "now".
Observe the results in /tmp/curatordebug.txt on every node. Below you'll find what I looked for using grep.
Let's try to run these arguments adding "curator" at the beginning of the command line
According to "curator snapshot --help" --ignore_unavailable is listed under "OPTIONS" so it should be valid! NLS puts this one at the very end of the command and it might be seen as an ARGUMENT rather than an OPTION. Observe the Usage line closely please!
So we have to move --ignore_unavailable on the command line to the OPTIONs section, right before the COMMAND (value: indices) and the Index Selection part begins.
That solves your problem! Look at this:
This info might help speed up Nagios developers to create fix for the bug introduced in version 1.4.3
Additional tip for NLS admins: use Nagios Core/XI to check snapshot backups: https://github.com/jvandermeulen/Nagios ... _backup.sh
In this particular case an UNKOWN state (UNKNOWN: Unable to determine result within last 25 hours: No snapshots matched provided args.) brought this to my attention. Default threshold is 24+1=25 hours. No backup results in UNKNOWN, failed backup results in CRITICAL, partial backup results in WARNING.
Jørgen van der Meulen
Time to start debugging. Let us begin modifying the curator.sh script on all nodes (since this kind of subsystem job may run on any node) after making a copy.
Code: Select all
cp -p /usr/local/nagioslogserver/scripts/curator.sh{,.ORG}Code: Select all
[root@nls143 ~]# cat /usr/local/nagioslogserver/scripts/curator.sh
#!/bin/sh
date=$(date)
curator "$@"
echo --- >> /tmp/curatordebug.txt
echo $date >> /tmp/curatordebug.txt
echo "$@" >> /tmp/curatordebug.txt
Observe the results in /tmp/curatordebug.txt on every node. Below you'll find what I looked for using grep.
Code: Select all
[root@nls143 ~]# # grep -B1 ^snapshot /tmp/curatordebug.txt
Fri Nov 11 22:01:34 CET 2016
snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Code: Select all
[root@nls143 ~]# curator snapshot --repository SharedBackupRepo indices --older-than 1 --time-unit days --timestring %Y.%m.%d --ignore_unavailable
Error: no such option: --ignore_unavailableCode: Select all
[root@nls143 ~]# curator snapshot --help
Usage: curator snapshot [OPTIONS] COMMAND [ARGS]...
Take snapshots of indices (Backup)
Options:
--repository TEXT Repository name.
--name TEXT Override default name.
--prefix TEXT Override default prefix.
--wait_for_completion BOOLEAN Wait for snapshot to complete before
returning. [default: True]
--ignore_unavailable Ignore unavailable shards/indices.
--include_global_state BOOLEAN Store cluster global state with snapshot.
[default: True]
--partial Do not fail if primary shard is unavailable.
--request_timeout INTEGER Allow this many seconds before the
transaction times out. [default: 21600]
--skip-repo-validation Skip repository access validation.
--help Show this message and exit.
Commands:
indices Index selection.
That solves your problem! Look at this:
Code: Select all
# curator snapshot --repository SharedBackupRepo --ignore_unavailable indices --older-than 1 --time-unit days --timestring %Y.%m.%d
2016-11-11 22:18:47,869 INFO Job starting: snapshot indices
(...)
2016-11-11 22:18:50,495 INFO Snapshot name: curator-20161111211850
2016-11-11 22:18:54,476 INFO Snapshot curator-20161111211850 successfully completed.
2016-11-11 22:18:54,476 INFO Job completed successfully.Additional tip for NLS admins: use Nagios Core/XI to check snapshot backups: https://github.com/jvandermeulen/Nagios ... _backup.sh
In this particular case an UNKOWN state (UNKNOWN: Unable to determine result within last 25 hours: No snapshots matched provided args.) brought this to my attention. Default threshold is 24+1=25 hours. No backup results in UNKNOWN, failed backup results in CRITICAL, partial backup results in WARNING.
Jørgen van der Meulen