Page 1 of 3

Snapshots stopped

Posted: Wed Jan 13, 2016 6:43 pm
by Fred Kroeger
I'm running a 2 node NLS cluster (recently upgraded to NLS 1.4.0) -and noticed that the snapshots stopped being generated on 2/1/16.
In the Snapshot Table (in the Backup & Maint screen) the name of all the existing snapshots (from 7/12/15 to 2/1/16) is showing as N/A which I'm pretty sure was not the case last year.
I did notice at the start of the year that snapshots had been created that had a date of 2014 so something happened around then I believe. I have deleted these since then.
The only thing I have done differently from the doco is that the Repository is local to each node - I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
Having said that, it had been working fine for almost a month before it stopped.
Any tips on troubleshooting the snapshots?
regards... Fred

Re: Snapshots stopped

Posted: Wed Jan 13, 2016 6:51 pm
by Box293
Fred Kroeger wrote:The only thing I have done differently from the doco is that the Repository is local to each node - I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
This needs to be a shared repository as the node which performs the job can be either one, I don't know the logic behind how it chooses the node. It most likely co-incided with a reboot of one of the nodes.

I'll get another tech to chime in on this too.

Re: Snapshots stopped

Posted: Wed Jan 13, 2016 8:24 pm
by Fred Kroeger
Thanks Troy
Yes I understand why the shared filesystem is required, which is why I tried it first as a local repository. It appears to choose the master node to run the snapshots on.
As it was all working, I was happy to leave it as it was knowing that I could only restore a snapshot from a single node.
regards... Fred

Re: Snapshots stopped

Posted: Thu Jan 14, 2016 1:32 pm
by jolson
I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
The logic regarding the node that 'picks up' the backup job is random - it can be either node. It's very likely that you have had many backup jobs fail to this, and the one node with the backup system would 'catch-up' when it randomly received the backup job. I highly recommend setting up NFS (even on one of your NLS nodes) so that either node can perform the backup properly - the system can have erroneous errors without proper backup mounts in place.

Now, you snapshots stopping could be related to this - or it might not be. I take it that the snapshots stopped working when you upgraded to 1.4.0? Try this:

Code: Select all

cd /your/backup/directory
mkdir oldbackups
mv * oldbackups
Then re-run your backup job from the GUI (Administration -> Command Subsystem -> Backup/Maintenance). When the backup job runs, a random node will pick it up and attempt to run it - hopefully your master node (or if you've setup a shared repository, either node is fine).

Let me know if this helps - occasionally the old backup system needs to be moved to make way for the new one - but not all of the time.

Re: Snapshots stopped

Posted: Sun Jan 17, 2016 9:29 pm
by Fred Kroeger
Rather than test unsupported processes, I have now created an NFS share on one of the nodes and the other node has mounted it OK.
I moved the old backups as suggested - I can see the oldbackups folder on both nodes OK and the Snapshots table in the Backup/Maintenance screen now has no entries.
Ran the Backup Maint command and now the table looks like below. What is strange is that there is an entry for 2015.01.07 - server wasn't built until December 2015 and only one index is listed as successful.
Looking at the Repository indices folder that was created , that rogue entry is shown in there now:

Code: Select all

# ls -la repositories/indices/
total 180
drwxr-xr-x 45 nagios users 4096 Jan 18 10:00 .
drwxr-xr-x  4 nagios users 4096 Jan 18 10:09 ..
[color=#FF0000]drwxr-xr-x  7 nagios users 4096 Jan 18 10:06 logstash-2015.01.07[/color]
drwxr-xr-x  7 nagios users 4096 Jan 18 10:04 logstash-2015.12.07
drwxr-xr-x  7 nagios users 4096 Jan 18 10:03 logstash-2015.12.08
drwxr-xr-x  7 nagios users 4096 Jan 18 10:04 logstash-2015.12.09
Snapshot.PNG

Re: Snapshots stopped

Posted: Mon Jan 18, 2016 12:53 pm
by jolson
What is strange is that there is an entry for 2015.01.07 - server wasn't built until December 2015 and only one index is listed as successful.
Is there any chance that this index does exist on your server? Check the 'Administration -> Index Status' page and look for the index in question. If a machine that is sending logs has a date set in the past, it's possible that it is opening old indexes.

Jesse

Re: Snapshots stopped

Posted: Mon Jan 18, 2016 7:22 pm
by Fred Kroeger
Didn't think it could happen - but it got worse....
Yes there is definitely an index for 07/01/15
index.PNG
However.... today in my Snapshot table I have two entries for every index and last nights snapshot updated the index for 27/12/15 ?
Snapshot-2.PNG

Re: Snapshots stopped

Posted: Tue Jan 19, 2016 10:30 am
by jolson
It would be useful to look at the index that was opened in the past so that you may see which server is causing the problem. Use a custom time filter to travel back in time:
2016-01-19 09_17_42-Dashboard • Nagios Log Server.png
Your snapshot subsystem looks fine - we have moved to an incremental backup system, so that snapshots will always be taken of every index regardless of whether or not the data has changed.

The interface is a little messy now, which will be corrected in a future version. Thanks!

Re: Snapshots stopped

Posted: Wed Jan 20, 2016 2:57 am
by Fred Kroeger
You're right about the messy listing , I now have 3 listings, each containing all the indexes.
So back to the starnge index - it's a syslog from the NLS server itself. It appears that there was a 6minute window where the logs reported the wrong date. This would have coincided with when I upgraded NLS, not sure how or why the date would get changed during the upgrade - but I will delete it .
Thanks for getting me on the right track.

BTW - when I went to delete the index via the Index Status Screen - noticed a typo at the bottom of the screen - should be " indices "
Capture.PNG
regards... Fred

Re: Snapshots stopped

Posted: Wed Jan 20, 2016 12:06 pm
by jolson
Good call! I'll get that typo fixed up - thanks. Is this case good to close?