Page 1 of 1

snapshots – problem of large, geo backup repo

Posted: Fri Nov 06, 2020 4:55 am
by dariusz.nalazek
Hello.

I’m using Nagios LS (cluster) in datacenter A, but backup repo (snapshots) I have in datacenter B.
Connection is like 8-10ms and link has lot of spare bandwitch.

Implementation of remote snapshot’s is “by the book” published by Nagios (for CIFS/NFS)

Problem show up when log storage become “a bit larger” and https://nagios_ls/nagioslogserver/admin/snapshots starts to timeout with info like “Gateway Timeout. The gateway did not receive a timely response from the upstream server or application. (504)”. Backup works, just the administration page is gone.

While entering on admin/snapshots page, Nagios LS uses command "du –sh /patch_to_network_share", and execution time of the command cause problem.

In case of local execution (SCSI/FC) time of processing is irrelevant, just too small to worry about.
In case of local datacenter for 2TB of data, was no problem.
But using 10ms link to another DC it’s big deal :(

1TB via CIFS is processing like ~1min, so 2TB ~2min and so on, 100TB ~100min, so it cause timeout even for 2TB.
1TB via NFS… execution time is from 40sec to 30min, depend of NFS cache, so eg. 1st execution of ‘du’ is 30min, next one less than 1min, then when cache expires again time is very long.
So… 100TB data can be processed up to ~100min to ~50 hours, depend on protocol/settings, but in general problem starts ~1TB of backup repo. My target repo size is like 100TB, so this value shows up.

Is there any workaround for it? Like skip ‘du’ check, and/or pre check du (eg. once per xx hours/minutes) and write output to some temp file and then while loading page read data from the file?

Darek.

Re: snapshots – problem of large, geo backup repo

Posted: Fri Nov 06, 2020 4:00 pm
by cdienger
This 'du -sh' behavior is hardcoded so I'll have to ping dev to see if we can change it or provide a patch. Please ping me again on Monday if I don't follow up before then.

In the meantime try moving the /usr/bin/du binary found on the NLS system. NLS expects du to be in this directory so if we move it it may help with the timeouts.

Re: snapshots – problem of large, geo backup repo

Posted: Mon Nov 09, 2020 3:17 am
by dariusz.nalazek
Temporary workaround seems to work (remove/move the du command), admin page show up, just info about storage usage is gone (as expected).

However I can’t make the solution permanent, a few ext and xfs check's in my Nagios XI is based on du usage. So if ill want to enter admin page I’ll temporary rename du ;) good enough while waiting for permanent solution.

Re: snapshots – problem of large, geo backup repo

Posted: Mon Nov 09, 2020 3:53 am
by dariusz.nalazek
some minor workaround, better then move/delete du ;)
rename du to du2 and create minor script usr/bin/du like below:

Code: Select all

#!/bin/sh
#
# info:
#   minor workarround for du problem with big geo backup repo
#   run: mv -v /usr/bin/du /usr/bin/du2
#   create /usr/bin/du script as below
#   run: chmod +x /usr/bin/du
#   and done.

repo_path="/mnt/snapshot_repository"

if [ $# -eq 0 ]
  then
    /usr/bin/du2
  else 
    if [ $2 != "$repo_path" ]
      then
        /usr/bin/du2 $@
       else
         echo "Do nothing - Nagios LS workaround."
    fi
fi
Darek.

Re: snapshots – problem of large, geo backup repo

Posted: Mon Nov 09, 2020 4:58 pm
by cdienger
Thanks for the follow up and additional work around. I also sent you a pm of a patch. Please test and let me know if that works for you.