multiple snapshots_maintenance jobs running

CBoekhuis · Post by **CBoekhuis** » Tue Jun 19, 2018 9:44 am

This weekend a problem arose on our 2 node cluster where the daily snapshots_maintenance schedule triggers multiple snapshots_maintenance jobs in parallel (?). Obviously they are biting each other.

Both systems running CentOS Linux release 7.5.1804 (Core)
NLS version 2.0.3 (last friday 15/06 I upgraded NLS from 2.0.2 to 2.0.3)

I've tried to resolve it by resetting all jobs, but that didn't help. I ran the snapshots_maintenance today with a tail on the jobs.log on both servers, so I can provide them if you need them (they are a bit lengthy

).
As for NLS, besides the high load during the snapshots_maintenance, it's running fine, but this issue also has me very worried about the sanity of the snapshots.

Please let me know what info I can provide (system profile etc.) And I'll upload it to the thread.

Thanks in advance,
Hans Blom

scottwilkerson · Post by **scottwilkerson** » Tue Jun 19, 2018 11:30 am

CBoekhuis wrote:As for NLS, besides the high load during the snapshots_maintenance, it's running fine

This can be normal if there are larger indexes.

CBoekhuis wrote:but this issue also has me very worried about the sanity of the snapshots.

This actually shouldn't matter at all, as the snapshots are differential, it could actually be run hourly and be fine and not even take up any additional space.

CBoekhuis · Post by **CBoekhuis** » Tue Jun 19, 2018 12:00 pm

Hi Scott,

Maybe I'm not clear, but the schedule should start (assumption) 1 job consisting out of a close index, delete index and snapshot (etc.) , but not 28 times at the same time.
[img]snapshot.PNG[/img]
Resulting in 1 snapshot and not 28 (or whatever random number) snapshots.

scottwilkerson · Post by **scottwilkerson** » Tue Jun 19, 2018 12:39 pm

Oh ya, that doesn't sound normal, I thought you just ended up with 2.

You can send your files. Also, did they complete and does the Command Subsystem list show SUCCESS?

CBoekhuis · Post by **CBoekhuis** » Tue Jun 19, 2018 1:42 pm

No problem. Yes the schedule will show as success and they will complete, but you'll see a lot of "snapshot already running" and an index can not be found/deleted because they first run obviously already deleted it.

subcommand.PNG

I'll upload the other system profile in the next message due to restriction

Heads up, colog3 is the master concerning these logfiles.

CBoekhuis · Post by **CBoekhuis** » Tue Jun 19, 2018 1:44 pm

And here's the other system-profile.
Due to TZ diff. I'm signing of for this day

. Have a nice one and thanks in advance.

Greetz....Hans

scottwilkerson · Post by **scottwilkerson** » Wed Jun 20, 2018 9:26 am

Has this happened multiple time or just this one time?

I've looked over the logs and dont' see anything alarming other than in the jobs log it is trying to run the jobs multiple times.

CBoekhuis · Post by **CBoekhuis** » Thu Jun 21, 2018 2:11 am

This started last weekend and happens every time the schedule is run.

scottwilkerson · Post by **scottwilkerson** » Thu Jun 21, 2018 2:19 pm

I am trying to replicate this but not having any success. I can't seem to find anything that would cause this to happen..

I will let you know if I have a break through.

CBoekhuis · Post by **CBoekhuis** » Fri Jun 22, 2018 6:55 am

Hi Scott,

something else came to mind, it might be totally unrelated, but I think it's worth mentioning. Especially since it's such a vague situation.
I'll upload the elasticsearch logfile from the master node of last saturday. In the beginning of the file You'll find the following 2 entries:

Code: Select all

[2018-06-16 12:21:37,286][INFO ][cluster.metadata         ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] creating index, cause [auto(bulk api)], templates [logstash], shards [5]/[1], mappings [_default_, syslog]
[2018-06-16 12:21:38,124][INFO ][cluster.metadata         ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] update_mapping [syslog] (dynamic)

What happened here is a someone from networks starting the syslog on a switch that parses an incomplete/malformed date resulting in a future date index logstash-2018.12.20. 24 seconds later the backup_maintenance schedule starts.
At 12:36 I'm receiving messages from XI because of an unexpected load on both nodes. I've instructed him to switch off the syslog on the switch and I deleted the future date index. But looking at the logfile, the snapshot(s) haven't finished yet.

Maybe it has nothing to do with it, but you never know.

Nagios Support Forum

multiple snapshots_maintenance jobs running

multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running

Re: multiple snapshots_maintenance jobs running