multiple snapshots_maintenance jobs running

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

multiple snapshots_maintenance jobs running

Post by CBoekhuis »

This weekend a problem arose on our 2 node cluster where the daily snapshots_maintenance schedule triggers multiple snapshots_maintenance jobs in parallel (?). Obviously they are biting each other.

Both systems running CentOS Linux release 7.5.1804 (Core)
NLS version 2.0.3 (last friday 15/06 I upgraded NLS from 2.0.2 to 2.0.3)

I've tried to resolve it by resetting all jobs, but that didn't help. I ran the snapshots_maintenance today with a tail on the jobs.log on both servers, so I can provide them if you need them (they are a bit lengthy ;) ).
As for NLS, besides the high load during the snapshots_maintenance, it's running fine, but this issue also has me very worried about the sanity of the snapshots.

Please let me know what info I can provide (system profile etc.) And I'll upload it to the thread.

Thanks in advance,
Hans Blom
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: multiple snapshots_maintenance jobs running

Post by scottwilkerson »

CBoekhuis wrote:As for NLS, besides the high load during the snapshots_maintenance, it's running fine
This can be normal if there are larger indexes.
CBoekhuis wrote:but this issue also has me very worried about the sanity of the snapshots.


This actually shouldn't matter at all, as the snapshots are differential, it could actually be run hourly and be fine and not even take up any additional space.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: multiple snapshots_maintenance jobs running

Post by CBoekhuis »

Hi Scott,

Maybe I'm not clear, but the schedule should start (assumption) 1 job consisting out of a close index, delete index and snapshot (etc.) , but not 28 times at the same time.
[img]snapshot.PNG[/img]
Resulting in 1 snapshot and not 28 (or whatever random number) snapshots.
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: multiple snapshots_maintenance jobs running

Post by scottwilkerson »

Oh ya, that doesn't sound normal, I thought you just ended up with 2.

You can send your files. Also, did they complete and does the Command Subsystem list show SUCCESS?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: multiple snapshots_maintenance jobs running

Post by CBoekhuis »

No problem. Yes the schedule will show as success and they will complete, but you'll see a lot of "snapshot already running" and an index can not be found/deleted because they first run obviously already deleted it.
subcommand.PNG
I'll upload the other system profile in the next message due to restriction

Heads up, colog3 is the master concerning these logfiles.
You do not have the required permissions to view the files attached to this post.
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: multiple snapshots_maintenance jobs running

Post by CBoekhuis »

And here's the other system-profile.
Due to TZ diff. I'm signing of for this day ;) . Have a nice one and thanks in advance.

Greetz....Hans
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: multiple snapshots_maintenance jobs running

Post by scottwilkerson »

Has this happened multiple time or just this one time?

I've looked over the logs and dont' see anything alarming other than in the jobs log it is trying to run the jobs multiple times.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: multiple snapshots_maintenance jobs running

Post by CBoekhuis »

This started last weekend and happens every time the schedule is run.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: multiple snapshots_maintenance jobs running

Post by scottwilkerson »

I am trying to replicate this but not having any success. I can't seem to find anything that would cause this to happen..

I will let you know if I have a break through.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
CBoekhuis
Posts: 234
Joined: Tue Aug 16, 2011 4:55 am

Re: multiple snapshots_maintenance jobs running

Post by CBoekhuis »

Hi Scott,

something else came to mind, it might be totally unrelated, but I think it's worth mentioning. Especially since it's such a vague situation.
I'll upload the elasticsearch logfile from the master node of last saturday. In the beginning of the file You'll find the following 2 entries:

Code: Select all

[2018-06-16 12:21:37,286][INFO ][cluster.metadata         ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] creating index, cause [auto(bulk api)], templates [logstash], shards [5]/[1], mappings [_default_, syslog]
[2018-06-16 12:21:38,124][INFO ][cluster.metadata         ] [dd139ec4-41a3-4780-95ef-9a564fb414ef] [logstash-2018.12.20] update_mapping [syslog] (dynamic)
What happened here is a someone from networks starting the syslog on a switch that parses an incomplete/malformed date resulting in a future date index logstash-2018.12.20. 24 seconds later the backup_maintenance schedule starts.
At 12:36 I'm receiving messages from XI because of an unexpected load on both nodes. I've instructed him to switch off the syslog on the switch and I deleted the future date index. But looking at the logfile, the snapshot(s) haven't finished yet.

Maybe it has nothing to do with it, but you never know.
You do not have the required permissions to view the files attached to this post.
Locked