Snapshots stopped
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Snapshots stopped
I'm running a 2 node NLS cluster (recently upgraded to NLS 1.4.0) -and noticed that the snapshots stopped being generated on 2/1/16.
In the Snapshot Table (in the Backup & Maint screen) the name of all the existing snapshots (from 7/12/15 to 2/1/16) is showing as N/A which I'm pretty sure was not the case last year.
I did notice at the start of the year that snapshots had been created that had a date of 2014 so something happened around then I believe. I have deleted these since then.
The only thing I have done differently from the doco is that the Repository is local to each node - I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
Having said that, it had been working fine for almost a month before it stopped.
Any tips on troubleshooting the snapshots?
regards... Fred
In the Snapshot Table (in the Backup & Maint screen) the name of all the existing snapshots (from 7/12/15 to 2/1/16) is showing as N/A which I'm pretty sure was not the case last year.
I did notice at the start of the year that snapshots had been created that had a date of 2014 so something happened around then I believe. I have deleted these since then.
The only thing I have done differently from the doco is that the Repository is local to each node - I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
Having said that, it had been working fine for almost a month before it stopped.
Any tips on troubleshooting the snapshots?
regards... Fred
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Snapshots stopped
This needs to be a shared repository as the node which performs the job can be either one, I don't know the logic behind how it chooses the node. It most likely co-incided with a reboot of one of the nodes.Fred Kroeger wrote:The only thing I have done differently from the doco is that the Repository is local to each node - I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
I'll get another tech to chime in on this too.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Snapshots stopped
Thanks Troy
Yes I understand why the shared filesystem is required, which is why I tried it first as a local repository. It appears to choose the master node to run the snapshots on.
As it was all working, I was happy to leave it as it was knowing that I could only restore a snapshot from a single node.
regards... Fred
Yes I understand why the shared filesystem is required, which is why I tried it first as a local repository. It appears to choose the master node to run the snapshots on.
As it was all working, I was happy to leave it as it was knowing that I could only restore a snapshot from a single node.
regards... Fred
Re: Snapshots stopped
The logic regarding the node that 'picks up' the backup job is random - it can be either node. It's very likely that you have had many backup jobs fail to this, and the one node with the backup system would 'catch-up' when it randomly received the backup job. I highly recommend setting up NFS (even on one of your NLS nodes) so that either node can perform the backup properly - the system can have erroneous errors without proper backup mounts in place.I haven't setup a shared filesystem (NFS) as I don't have another server with enough disk to act as the Repository.
Now, you snapshots stopping could be related to this - or it might not be. I take it that the snapshots stopped working when you upgraded to 1.4.0? Try this:
Code: Select all
cd /your/backup/directory
mkdir oldbackups
mv * oldbackupsLet me know if this helps - occasionally the old backup system needs to be moved to make way for the new one - but not all of the time.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Snapshots stopped
Rather than test unsupported processes, I have now created an NFS share on one of the nodes and the other node has mounted it OK.
I moved the old backups as suggested - I can see the oldbackups folder on both nodes OK and the Snapshots table in the Backup/Maintenance screen now has no entries.
Ran the Backup Maint command and now the table looks like below. What is strange is that there is an entry for 2015.01.07 - server wasn't built until December 2015 and only one index is listed as successful.
Looking at the Repository indices folder that was created , that rogue entry is shown in there now:
I moved the old backups as suggested - I can see the oldbackups folder on both nodes OK and the Snapshots table in the Backup/Maintenance screen now has no entries.
Ran the Backup Maint command and now the table looks like below. What is strange is that there is an entry for 2015.01.07 - server wasn't built until December 2015 and only one index is listed as successful.
Looking at the Repository indices folder that was created , that rogue entry is shown in there now:
Code: Select all
# ls -la repositories/indices/
total 180
drwxr-xr-x 45 nagios users 4096 Jan 18 10:00 .
drwxr-xr-x 4 nagios users 4096 Jan 18 10:09 ..
[color=#FF0000]drwxr-xr-x 7 nagios users 4096 Jan 18 10:06 logstash-2015.01.07[/color]
drwxr-xr-x 7 nagios users 4096 Jan 18 10:04 logstash-2015.12.07
drwxr-xr-x 7 nagios users 4096 Jan 18 10:03 logstash-2015.12.08
drwxr-xr-x 7 nagios users 4096 Jan 18 10:04 logstash-2015.12.09You do not have the required permissions to view the files attached to this post.
Re: Snapshots stopped
Is there any chance that this index does exist on your server? Check the 'Administration -> Index Status' page and look for the index in question. If a machine that is sending logs has a date set in the past, it's possible that it is opening old indexes.What is strange is that there is an entry for 2015.01.07 - server wasn't built until December 2015 and only one index is listed as successful.
Jesse
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Snapshots stopped
Didn't think it could happen - but it got worse....
Yes there is definitely an index for 07/01/15 However.... today in my Snapshot table I have two entries for every index and last nights snapshot updated the index for 27/12/15 ?
Yes there is definitely an index for 07/01/15 However.... today in my Snapshot table I have two entries for every index and last nights snapshot updated the index for 27/12/15 ?
You do not have the required permissions to view the files attached to this post.
Re: Snapshots stopped
It would be useful to look at the index that was opened in the past so that you may see which server is causing the problem. Use a custom time filter to travel back in time:
Your snapshot subsystem looks fine - we have moved to an incremental backup system, so that snapshots will always be taken of every index regardless of whether or not the data has changed.
The interface is a little messy now, which will be corrected in a future version. Thanks!
The interface is a little messy now, which will be corrected in a future version. Thanks!
You do not have the required permissions to view the files attached to this post.
-
Fred Kroeger
- Posts: 588
- Joined: Wed Oct 19, 2011 11:36 pm
- Location: Perth, Western Australia
- Contact:
Re: Snapshots stopped
You're right about the messy listing , I now have 3 listings, each containing all the indexes.
So back to the starnge index - it's a syslog from the NLS server itself. It appears that there was a 6minute window where the logs reported the wrong date. This would have coincided with when I upgraded NLS, not sure how or why the date would get changed during the upgrade - but I will delete it .
Thanks for getting me on the right track.
BTW - when I went to delete the index via the Index Status Screen - noticed a typo at the bottom of the screen - should be " indices "
regards... Fred
So back to the starnge index - it's a syslog from the NLS server itself. It appears that there was a 6minute window where the logs reported the wrong date. This would have coincided with when I upgraded NLS, not sure how or why the date would get changed during the upgrade - but I will delete it .
Thanks for getting me on the right track.
BTW - when I went to delete the index via the Index Status Screen - noticed a typo at the bottom of the screen - should be " indices "
regards... Fred
You do not have the required permissions to view the files attached to this post.
Re: Snapshots stopped
Good call! I'll get that typo fixed up - thanks. Is this case good to close?