Page 1 of 2

Logstash fails just before or during scheduled snapshot

Posted: Wed Nov 21, 2018 10:17 am
by rferebee
Hello,

We've encountered an issue where logstash is failing just before or during our scheduled snapshots and we're unable to complete snapshots as a result. The most recent occurrence, logstash failed about 2 hours into our snapshot window and Log Server is acting like it's still running a snapshot even though in Command Subsystem it says Job Status "Waiting" with a next run time of 1:30AM tomorrow.

Typically we schedule our snapshots to run at 22:30 every day. Recently, our snapshots seem to be running whenever they feel like it and won't complete until well into the following day. If logstash fails after our snapshot starts then the Command Subsystem will say that the snapshot started right when logstash failed.

Right now, my entire interface is locked up because I clicked on Snapshots & Maintenance which usually only happens when a snapshot is still running. It has been several weeks and our snapshots have been extremely sporadic. We're at the point where we're losing data because our index are no longer overlapping in the snapshots.

I'm not sure what else I can look at to figure out what's going on. Thank you.

Re: Logstash fails just before or during scheduled snapshot

Posted: Wed Nov 21, 2018 10:30 am
by rferebee
I should add some information about our setup:

2 VMs in running in a cluster (LS1 and LS2)

6 CPUs per VM
64GBs of RAM per VM
6TB of storage per VM

Our Log Server repository has 62TB of storage with 10TB free (we recently increased this storage space when we found out we could no longer create a snapshot).

Currently, each one of our snapshots is roughly 3TB. We snapshot 20 indexes at about 150GB per index.

Re: Logstash fails just before or during scheduled snapshot

Posted: Wed Nov 21, 2018 11:10 am
by cdienger
It sounds like you may be running into the problem described in https://support.nagios.com/kb/article/n ... g-576.html. Go through the doc and make the changes to both NLS machines.

Re: Logstash fails just before or during scheduled snapshot

Posted: Wed Nov 21, 2018 11:43 am
by rferebee
I will make the changes and update you accordingly. Thank you very much for the prompt reply!

Re: Logstash fails just before or during scheduled snapshot

Posted: Wed Nov 21, 2018 12:13 pm
by cdienger
No problem! We'll be here (except after 2 today and the rest of the week :) ) waiting for the results

Re: Logstash fails just before or during scheduled snapshot

Posted: Mon Nov 26, 2018 10:56 am
by rferebee
Ok, so this seems to have resolved our issue.

I was wondering if you could elaborate on what those two variables control though? It looks like we're getting snapshots, but each one has drastically reduced their disk consumption for whatever reason. Each snap is looks like it's only a few hundred GBs since making this change.

Re: Logstash fails just before or during scheduled snapshot

Posted: Mon Nov 26, 2018 4:07 pm
by cdienger
Glad to hear. The memory option allocates more memory to the logstash java process(starts logstash with the java option"-Xmx1000m"). The log file option allows the OS to open more files at a time. It's more likely the memory option resolved the issue and the large snapshots could be the result of incomplete snapshots. These will clear out on their own over time per the maintenance settings.

Re: Logstash fails just before or during scheduled snapshot

Posted: Tue Nov 27, 2018 10:58 am
by rferebee
Ok, thanks for the information.

It's my understanding that a snapshot is a backup of all the open indexes up whatever number the user specifies, in my case 20. If each of my indexes is 150GBs, shouldn't my snapshot be around 3TBs? Or, is there some type of optimization going on which reduces the size of the indexes when not in use that I'm not aware of?

Re: Logstash fails just before or during scheduled snapshot

Posted: Tue Nov 27, 2018 3:02 pm
by cdienger
A snapshot is a diff from the previous snapshot.

Re: Logstash fails just before or during scheduled snapshot

Posted: Tue Nov 27, 2018 4:11 pm
by rferebee
Oh, so it's incremental. Gotcha.