daily system backup getting out of control

CBoekhuis · Post by **CBoekhuis** » Thu Aug 24, 2023 8:18 am

Hi,

Hope you can help on this one, the daily system backup is getting larger by the day. At this moment each node produces a 8,5GB tar.gz and obviously takes hours to complete.
The real problem is that more often the backups will fail for all kinds of reasons like "Waiting for available slot." messages, or worse when "{"acknowledged":true,"persistent":{},"transient":{"plugin":{"knapsack":{"export":{"state":"[]"}}}}}" messages appear in the /tmp/backup.log.
The worst one is when it wil only produce a 26303 byte tar.gz which is broken and empty. At that point I will have to restart elasticsearch to get it to function again.

Reading in the manual, within the system backup amongst dashboards, etc. also de audit log is saved. That gave me the thought "what is the retention of the audit log?". Escpecially since we have the "save user query to audit" on a swell.
Digging through the audit log, I ended up somewhere in 2016 when we started our cluster. That might explain why the backup is so large and always growing.

Question is, is there no retention on de audit log (or other logging saved in the system backup)? If not, how can I set a retention or at least clear out some old data. Unless something else is going, but I would appreciate some help. No backup is never good

.

Nagios log server version is 2.1.15 on CentOS 7.9

King Regards,
Hans Blom

swolf · Post by **swolf** » Thu Aug 24, 2023 1:23 pm

Hi @CBoekhuis, thanks for reaching out.

I agree with you that there 1) isn't proper retention configuration for the NLS audit log, and 2) there should be. I've filed a feature request on your behalf.

To handle the immediate situation, I would determine how much data you want to keep, and find an approximate timestamp (unix epoch in milliseconds) that corresponds to the oldest record you'd like to keep.

Once you're absolutely sure you have the right time, I would take a VM- or server-level backup/snapshot, then run this query on your terminal, replacing my timestamp with yours:

Code: Select all

curl -XDELETE 'localhost:9200/nagioslogserver_log/_query' -d '{ "query": { "range" : { "created": { "lte": 1692900851854 } } } }'

Please be very careful, as we don't have a defined process for fixing a mistake here.

Hopefully that helps - please let me know if you have any further questions or concerns.

-Sebastian Wolf

CBoekhuis · Post by **CBoekhuis** » Fri Aug 25, 2023 6:06 am

Hi Sebastian,

thank you for your help! I just tested this on our test cluster and that works. Now I'll gradually reduce the audit log on our production cluster.
Looking forward to seeing this as a feature in a future release. Can you close this topic?

Greetings....Hans

CBoekhuis · Post by **CBoekhuis** » Fri Aug 25, 2023 8:12 am

I was a little to fast. Turns out that on our production cluster the big chunk of data is in the nagioslogserver_history indice. It's 47,5GB large.
I take it that I can use the same command that you provided except change the nagioslogserver_log for nagioslogserver_history?

Maybe the feature request should include this indice as well ?

Thanks!

CBoekhuis · Post by **CBoekhuis** » Mon Aug 28, 2023 7:53 am

In the mean time I cleared out the entire nagioslogserver_history indice.
I don't know if there's a retention set to this indice, but just like the case with the audit log, it would be a valuable feature if this is a configurable option as well.

Kind Regards...Hans

Nagios Support Forum

daily system backup getting out of control

daily system backup getting out of control

Re: daily system backup getting out of control

Re: daily system backup getting out of control

Re: daily system backup getting out of control

Re: daily system backup getting out of control