Nagios Log Server - Troubleshooting Backups

Overview

This article explains how to troubleshoot backups in Nagios Log Server.

There are two methods in which you can diagnose backup issues:

Watching the log server job logs and forcing a backup
Using the command line to execute backups

Watch Nagios Log Server Job Log

When using this method, you will force the backup to be executed through the web interface. Once doing this the backup with be executed by one of the nodes in the cluster (not specifically the node that you just executed the command through the web interface). With this in mind, you will need to watch the job log on ALL nodes in the cluster.

Open an SSH session to each node in your cluster.

Execute the following command:

tail -f /usr/local/nagioslogserver/var/jobs.log

Once you have done this on all nodes in the cluster, open the Nagios Log Server GUI.

On the top menu bar click Admin.

System > Command Subsystem

Click Edit for the backups job

Click inside the Next Run Time field

In the drop down calendar that appears click Now

Click Done

Click Update

Once you have done this, you will now need to watch all the SSH sessions to observe the backup process.

In the GUI, the job will show as running, you know when it is complete when this changes to waiting.

Command Line Backup

Unlike the previous steps, this command only needs an SSH session to one of the nodes in the cluster.

Open an SSH session to a node in your cluster.

Execute the following command:

curl -XGET "localhost:9200/_snapshot?pretty"

The purpose of this command was to get the name to the backup snapshot store to use in the following command. You can see in the following output the name we need to use is Common_Backups

{
  "Common_Backups" : {
    "type" : "fs",
    "settings" : {
      "compress" : "true",
      "location" : "/mnt/nagios_log_server_common_backups"
    }
  }
}

This command is what will execute the backup:

curator snapshot --repository "Common_Backups" indices --all-indices

Here is an example of the output produced while this command executes:

2016-04-15 13:52:40,373 INFO      Job starting: snapshot indices
2016-04-15 13:52:40,373 WARNING   Overriding default connection timeout.  New timeout: 21600
2016-04-15 13:52:40,438 INFO      Matching all indices. Ignoring flags other than --exclude.
2016-04-15 13:52:40,439 INFO      Action snapshot will be performed on the following indices: [u'kibana-int', u'logstash-2015.03.23', u'logstash-2015.03.24', u'logstash-2015.03.25']
2016-04-15 13:52:44,829 INFO      Snapshot name: curator-20160415035244
2016-04-15 13:53:04,015 INFO      Snapshot curator-20160415035244 successfully completed.
2016-04-15 13:53:04,015 INFO      Job completed successfully.

Note: The duration this command will run for depends on how much data exists in your log server implementation.

If you wanted more detailed output, you can run the command with the debug argument:

curator --loglevel debug snapshot --repository "Common_Backups" indices --all-indices

In addition to this, if you wanted to output all the data to a log file, the following command can be used:

curator --loglevel debug --logfile /tmp/test_backup.txt snapshot --repository "Common_Backups" indices --all-indices

Note: There will be no output displayed on the screen while this command runs as it is all being redirected to the log file. You will know when the command has completed as you will be returned to the bash prompt.

Cluster Master Node

It is important to understand which node is the cluster master. The cluster master is the node responsible for performing the actual backup.

Open a terminal session to a node in your cluster.

Execute the following command:

curl 'localhost:9200/_cat/master?v'

The output will be similar to:

id                     host                     ip         node                                 
JLpicZIOQSez77kwzJKx7g nls-c7x-x64.box293.local 10.25.5.86 4ab27926-bbb0-4a5e-bb7f-4eb9fba97643

It is from the node in this output you should do any backup testing. You should test all the nodes can perform a backup when they are the master. There is no command to change a node master from one node to another, however restarting elasticsearch service will force another node to become a master. The command to restart the elasictsearch service is:

RHEL 7 + | CentOS 7 + | Debian | Ubuntu 16/18/20

systemctl restart elasticsearch.service

After restarting the service and waiting a minute, execute the master command which should now show the new master:

id                     host                     ip         node                                 
LYSbImCgT9CHl6iqss1S0g nls-r7x-x64.box293.local 10.25.5.99 4c5786bd-1382-44b6-bb67-88a9c0d3e7ea

Note About Backup Repository

As per this documentation:

Documentation - Managing Backups and Maintenance

This paragraph is important:

When you are on the Backup & Maintenance page, the table on the right labelled Repositories is where you will set the location for your Nagios Log Server backup to be stored. This location must be a shared network path writeable by the nagios user and available to ALL instances in your cluster.

Final Thoughts

For any support related questions please visit the Nagios Support Forums at:

http://support.nagios.com/forum/