Nagios Support Forum

Posted: **Fri Nov 06, 2020 4:55 am**

Hello.

I’m using Nagios LS (cluster) in datacenter A, but backup repo (snapshots) I have in datacenter B.
Connection is like 8-10ms and link has lot of spare bandwitch.

Implementation of remote snapshot’s is “by the book” published by Nagios (for CIFS/NFS)

Problem show up when log storage become “a bit larger” and https://nagios_ls/nagioslogserver/admin/snapshots starts to timeout with info like “Gateway Timeout. The gateway did not receive a timely response from the upstream server or application. (504)”. Backup works, just the administration page is gone.

While entering on admin/snapshots page, Nagios LS uses command "du –sh /patch_to_network_share", and execution time of the command cause problem.

In case of local execution (SCSI/FC) time of processing is irrelevant, just too small to worry about.
In case of local datacenter for 2TB of data, was no problem.
But using 10ms link to another DC it’s big deal

1TB via CIFS is processing like ~1min, so 2TB ~2min and so on, 100TB ~100min, so it cause timeout even for 2TB.
1TB via NFS… execution time is from 40sec to 30min, depend of NFS cache, so eg. 1st execution of ‘du’ is 30min, next one less than 1min, then when cache expires again time is very long.
So… 100TB data can be processed up to ~100min to ~50 hours, depend on protocol/settings, but in general problem starts ~1TB of backup repo. My target repo size is like 100TB, so this value shows up.

Is there any workaround for it? Like skip ‘du’ check, and/or pre check du (eg. once per xx hours/minutes) and write output to some temp file and then while loading page read data from the file?

Darek.

Posted: **Fri Nov 06, 2020 4:00 pm**

This 'du -sh' behavior is hardcoded so I'll have to ping dev to see if we can change it or provide a patch. Please ping me again on Monday if I don't follow up before then.

In the meantime try moving the /usr/bin/du binary found on the NLS system. NLS expects du to be in this directory so if we move it it may help with the timeouts.

Posted: **Mon Nov 09, 2020 3:17 am**

Temporary workaround seems to work (remove/move the du command), admin page show up, just info about storage usage is gone (as expected).

However I can’t make the solution permanent, a few ext and xfs check's in my Nagios XI is based on du usage. So if ill want to enter admin page I’ll temporary rename du

good enough while waiting for permanent solution.

Posted: **Mon Nov 09, 2020 3:53 am**

some minor workaround, better then move/delete du

rename du to du2 and create minor script usr/bin/du like below:

Code: Select all

#!/bin/sh
#
# info:
#   minor workarround for du problem with big geo backup repo
#   run: mv -v /usr/bin/du /usr/bin/du2
#   create /usr/bin/du script as below
#   run: chmod +x /usr/bin/du
#   and done.

repo_path="/mnt/snapshot_repository"

if [ $# -eq 0 ]
  then
    /usr/bin/du2
  else 
    if [ $2 != "$repo_path" ]
      then
        /usr/bin/du2 $@
       else
         echo "Do nothing - Nagios LS workaround."
    fi
fi

Darek.

Posted: **Mon Nov 09, 2020 4:58 pm**

Thanks for the follow up and additional work around. I also sent you a pm of a patch. Please test and let me know if that works for you.

Nagios Support Forum

snapshots – problem of large, geo backup repo

snapshots – problem of large, geo backup repo

Re: snapshots – problem of large, geo backup repo

Re: snapshots – problem of large, geo backup repo

Re: snapshots – problem of large, geo backup repo

Re: snapshots – problem of large, geo backup repo