> Is there any 'special' configuration that you can think of on your machine? Any customizations? What does the load look like?
I can't think of anything really special on these systems. We do run an iptables firewall (required by policy) but it's set to log anything that gets rejected, and nothing is hitting it. And I temporarily disabled the firewall on both node, but the problem still occurs.
Here's Node 1:
Code: Select all
top - 09:57:07 up 20 days, 23:13, 2 users, load average: 0.08, 0.12, 0.09
Tasks: 269 total, 1 running, 268 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.9%sy, 0.1%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32876496k total, 18012912k used, 14863584k free, 727380k buffers
Swap: 1023996k total, 0k used, 1023996k free, 7479188k cached
[root@gtcs-nls01 1430318281]# free -m
total used free shared buffers cached
Mem: 32105 17590 14515 0 710 7304
-/+ buffers/cache: 9576 22529
Swap: 999 0 999
[root@gtcs-nls01 1430318281]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-root
3.9G 544M 3.1G 15% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 194M 95M 89M 52% /boot
/dev/mapper/vg00-home
2.0G 250M 1.6G 14% /home
/dev/mapper/vg00-opt 3.9G 128M 3.5G 4% /opt
/dev/mapper/vg00-tmp 2.0G 538M 1.3G 30% /tmp
/dev/mapper/vg00-usr 3.9G 2.2G 1.5G 61% /usr
/dev/mapper/vg00-var 3.9G 1.1G 2.6G 31% /var
/dev/mapper/vg01-srv 1004G 2.2G 951G 1% /srv
hpc-archive:/archive/nagioslogserver
20T 17T 2.8T 87% /arcfiniti/nagioslogserver
And here's Node 2:
Code: Select all
top - 10:02:49 up 20 days, 23:18, 2 users, load average: 0.10, 0.10, 0.09
Tasks: 267 total, 1 running, 266 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 0.9%sy, 0.1%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32876496k total, 18008764k used, 14867732k free, 727380k buffers
Swap: 1023996k total, 0k used, 1023996k free, 7473512k cached
[root@gtcs-nls01 sysconfig]# free -m
total used free shared buffers cached
Mem: 32105 17587 14518 0 710 7298
-/+ buffers/cache: 9578 22527
Swap: 999 0 999
[root@gtcs-nls01 sysconfig]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-root
3.9G 544M 3.1G 15% /
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 194M 95M 89M 52% /boot
/dev/mapper/vg00-home
2.0G 250M 1.6G 14% /home
/dev/mapper/vg00-opt 3.9G 128M 3.5G 4% /opt
/dev/mapper/vg00-tmp 2.0G 538M 1.3G 30% /tmp
/dev/mapper/vg00-usr 3.9G 2.2G 1.5G 61% /usr
/dev/mapper/vg00-var 3.9G 1.1G 2.6G 31% /var
/dev/mapper/vg01-srv 1004G 2.2G 951G 1% /srv
hpc-archive:/archive/nagioslogserver
20T 17T 2.8T 87% /arcfiniti/nagioslogserver
> It's unusual for these backups to stick because typically config backups are rather small. What size are yours?
Not big:
Code: Select all
[root@gtcs-nls01 sysconfig]# ls -l -h /store/backups/nagioslogserver/
total 55M
drwxrwxrwx 2 nagios users 4.0K Apr 29 09:37 1430318242
drwxrwxrwx 2 nagios users 4.0K Apr 29 09:38 1430318281
-rw-r--r-- 1 nagios users 5.4M Apr 28 15:16 nagioslogserver.2015-04-17.1429281426.tar.gz
-rw-r--r-- 1 nagios users 5.5M Apr 28 15:16 nagioslogserver.2015-04-18.1429367881.tar.gz
-rw-r--r-- 1 nagios users 5.8M Apr 28 15:16 nagioslogserver.2015-04-20.1429540631.tar.gz
-rw-r--r-- 1 nagios users 6.0M Apr 28 15:16 nagioslogserver.2015-04-21.1429627031.tar.gz
-rw-r--r-- 1 nagios users 6.1M Apr 28 15:16 nagioslogserver.2015-04-22.1429713431.tar.gz
-rw-r--r-- 1 nagios users 6.2M Apr 28 15:16 nagioslogserver.2015-04-23.1429799837.tar.gz
-rw-r--r-- 1 nagios users 6.4M Apr 28 15:16 nagioslogserver.2015-04-24.1429886236.tar.gz
-rw-r--r-- 1 nagios users 6.8M Apr 28 15:16 nagioslogserver.2015-04-27.1430145446.tar.gz
-rw-r--r-- 1 nagios users 7.0M Apr 28 15:16 nagioslogserver.2015-04-28.1430231842.tar.gz
> Is there anything notable in your cron log?
Just a whole bunch of lines like this:
Code: Select all
Apr 29 10:03:01 gtcs-nls01 CROND[6657]: (nagios) CMD (/usr/bin/php -q /var/www/html/nagioslogserver/www/index.php poller > /usr/local/nagioslogserver/var/poller.log 2>&1)
And some of these:
Code: Select all
Apr 29 09:10:01 gtcs-nls01 CROND[26943]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:20:01 gtcs-nls01 CROND[28262]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:30:01 gtcs-nls01 CROND[29675]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:40:01 gtcs-nls01 CROND[31394]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:50:01 gtcs-nls01 CROND[1934]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 10:00:01 gtcs-nls01 CROND[5808]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 10:01:01 gtcs-nls01 CROND[6068]: (root) CMD (run-parts /etc/cron.hourly)
Apr 29 10:01:01 gtcs-nls01 run-parts(/etc/cron.hourly)[6068]: starting 0anacron
Apr 29 10:01:01 gtcs-nls01 run-parts(/etc/cron.hourly)[6077]: finished 0anacron
Looks pretty harmless.
> The Elasticsearch errors are a great start, but why these backups are stuck in the first place is what I'm curious about. Is it true that your backups are still residing on the local disks of your instances?
Yes, /store/backups/nagioslogserver/ is on local disk on each node. The other kind of backup - the one that uses Elasticsearch snapshot/restore to backup the actual data - is going to a shared filesystem (that's the hpc-archive:/archive/nagioslogserver in the df above). That one works fine.