Page 1 of 2
multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 9:44 am
by CBoekhuis
This weekend a problem arose on our 2 node cluster where the daily snapshots_maintenance schedule triggers multiple snapshots_maintenance jobs in parallel (?). Obviously they are biting each other.
Both systems running CentOS Linux release 7.5.1804 (Core)
NLS version 2.0.3 (last friday 15/06 I upgraded NLS from 2.0.2 to 2.0.3)
I've tried to resolve it by resetting all jobs, but that didn't help. I ran the snapshots_maintenance today with a tail on the jobs.log on both servers, so I can provide them if you need them (they are a bit lengthy

).
As for NLS, besides the high load during the snapshots_maintenance, it's running fine, but this issue also has me very worried about the sanity of the snapshots.
Please let me know what info I can provide (system profile etc.) And I'll upload it to the thread.
Thanks in advance,
Hans Blom
Re: multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 11:30 am
by scottwilkerson
CBoekhuis wrote:As for NLS, besides the high load during the snapshots_maintenance, it's running fine
This can be normal if there are larger indexes.
CBoekhuis wrote:but this issue also has me very worried about the sanity of the snapshots.
This actually shouldn't matter at all, as the snapshots are differential, it could actually be run hourly and be fine and not even take up any additional space.
Re: multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 12:00 pm
by CBoekhuis
Hi Scott,
Maybe I'm not clear, but the schedule should start (assumption) 1 job consisting out of a close index, delete index and snapshot (etc.) , but not 28 times at the same time.
[img]snapshot.PNG[/img]
Resulting in 1 snapshot and not 28 (or whatever random number) snapshots.
Re: multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 12:39 pm
by scottwilkerson
Oh ya, that doesn't sound normal, I thought you just ended up with 2.
You can send your files. Also, did they complete and does the Command Subsystem list show SUCCESS?
Re: multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 1:42 pm
by CBoekhuis
No problem. Yes the schedule will show as success and they will complete, but you'll see a lot of "snapshot already running" and an index can not be found/deleted because they first run obviously already deleted it.
subcommand.PNG
I'll upload the other system profile in the next message due to restriction
Heads up, colog3 is the master concerning these logfiles.
Re: multiple snapshots_maintenance jobs running
Posted: Tue Jun 19, 2018 1:44 pm
by CBoekhuis
And here's the other system-profile.
Due to TZ diff. I'm signing of for this day

. Have a nice one and thanks in advance.
Greetz....Hans
Re: multiple snapshots_maintenance jobs running
Posted: Wed Jun 20, 2018 9:26 am
by scottwilkerson
Has this happened multiple time or just this one time?
I've looked over the logs and dont' see anything alarming other than in the jobs log it is trying to run the jobs multiple times.
Re: multiple snapshots_maintenance jobs running
Posted: Thu Jun 21, 2018 2:11 am
by CBoekhuis
This started last weekend and happens every time the schedule is run.
Re: multiple snapshots_maintenance jobs running
Posted: Thu Jun 21, 2018 2:19 pm
by scottwilkerson
I am trying to replicate this but not having any success. I can't seem to find anything that would cause this to happen..
I will let you know if I have a break through.
Re: multiple snapshots_maintenance jobs running
Posted: Fri Jun 22, 2018 6:55 am
by CBoekhuis
Hi Scott,
something else came to mind, it might be totally unrelated, but I think it's worth mentioning. Especially since it's such a vague situation.
I'll upload the elasticsearch logfile from the master node of last saturday. In the beginning of the file You'll find the following 2 entries:
Code: Select all
[2018-06-16 12:21:37,286][INFO ][cluster.metadata ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] creating index, cause [auto(bulk api)], templates [logstash], shards [5]/[1], mappings [_default_, syslog]
[2018-06-16 12:21:38,124][INFO ][cluster.metadata ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] update_mapping [syslog] (dynamic)
What happened here is a someone from networks starting the syslog on a switch that parses an incomplete/malformed date resulting in a future date index logstash-2018.12.20. 24 seconds later the backup_maintenance schedule starts.
At 12:36 I'm receiving messages from XI because of an unexpected load on both nodes. I've instructed him to switch off the syslog on the switch and I deleted the future date index. But looking at the logfile, the snapshot(s) haven't finished yet.
Maybe it has nothing to do with it, but you never know.