create_backup.sh hanging

jvestrum · Post by **jvestrum** » Wed Apr 08, 2015 10:51 am

jolson wrote:Could you please try rebooting the servers? I think that we have all of the information we can collect regarding this issue, and a reboot may very well resolve it.

So, I rebooted both instances, one at a time, and the stuck exports _still_ persisted. But then I had another idea - all this time I've been restarting elasticsearch one instance at a time, always leaving one running. So instead I stopped elasticsearch on both instances at the same time, and then brought them back up. That did the trick - the stuck exports are gone, and create_backup.sh runs successfully.

I'll keep an eye on it and see if the problem comes back over time, but for now everything looks okay.

cmerchant · Post by **cmerchant** » Wed Apr 08, 2015 10:57 am

That's interesting. Having one instance up at a time would keep the persistence of the state of the backups I will keep this thread open. Keep us updated.

jvestrum · Post by **jvestrum** » Thu Apr 09, 2015 9:46 am

Last night's scheduled backup went off without a hitch and no stuck exports.

Code: Select all

# curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}

jolson · Post by **jolson** » Thu Apr 09, 2015 9:48 am

Excellent - thank you very much for reporting in. Glad to hear everything is okay.

jvestrum · Post by **jvestrum** » Tue Apr 28, 2015 1:57 pm

jolson wrote:Excellent - thank you very much for reporting in. Glad to hear everything is okay.

I'm afraid the problem has come back.

Code: Select all

# ps -ef| grep backup
nagios    2471  2381  0 Apr19 ?        00:12:17 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios    3066  2519  0 Apr27 ?        00:00:48 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios    3110  2983  0 Apr20 ?        00:10:48 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios    8913  8790  0 Apr22 ?        00:07:54 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios    9295  9080  0 Apr24 ?        00:05:02 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   11418 11012  0 09:37 ?        00:00:04 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   12881 12671  0 Apr23 ?        00:06:29 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   17436 17180  0 Apr26 ?        00:02:04 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   22100 21777  0 Apr25 ?        00:03:33 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   22493 22453  0 Apr18 ?        00:13:51 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   22851 22841  0 Apr18 ?        00:13:44 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   23500 23476  0 Apr17 ?        00:15:22 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
nagios   23678 23667  0 Apr17 ?        00:15:15 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh
root     23705 16293  0 13:33 pts/1    00:00:00 grep backup
nagios   27115 27007  0 Apr21 ?        00:09:23 /bin/sh /usr/local/nagioslogserver/scripts/create_backup.sh

It was okay for a week and then another knapsack export got stuck:

Code: Select all

# curl -s -XPOST 'http://localhost:9200/_export/state'
{"count":6,"states":[{"mode":"export","started":"2015-04-17T14:37:06.884Z","path":"file:///store/backups/nagioslogserver/1429281426/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-04-18T14:37:07.379Z","path":"file:///store/backups/nagioslogserver/1429367827/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-04-18T14:37:07.419Z","path":"file:///store/backups/nagioslogserver/1429367827/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-18T14:38:02.219Z","path":"file:///store/backups/nagioslogserver/1429367882/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-18T14:38:02.309Z","path":"file:///store/backups/nagioslogserver/1429367882/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-26T14:37:17.622Z","path":"file:///store/backups/nagioslogserver/1430059037/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"}]}

If I try to abort the exports (with curl -XPOST 'localhost:9200/_export/abort'), I find this in the elasticsearch log:

Code: Select all

[2015-04-28 13:52:26,773][INFO ][KnapsackService          ] aborting 51 tasks
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@10e3e6a5 could not be cancelled
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@623b634c could not be cancelled
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@28f444f6 could not be cancelled
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@3cdb877b could not be cancelled
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@709bbb73 could not be cancelled
[2015-04-28 13:52:26,776][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@2fb4c7 could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@3c0574e9 could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@a3e8bc8 could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@104afcb9 could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@2b15de0a could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@28a4789f could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@45eb2391 could not be cancelled
[2015-04-28 13:52:26,777][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@506c2873 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@6cd3bf44 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@41b2412d could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@1dd5130d could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@fd5c9c8 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@1cb27197 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@73f7f39 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@59b97ae6 could not be cancelled
[2015-04-28 13:52:26,778][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@2033c00f could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@2586294d could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@b01917a could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@13f5e63b could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@74de35b9 could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@289940a0 could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@6561dd32 could not be cancelled
[2015-04-28 13:52:26,779][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@7c01319d could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@32607a05 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@5beaf819 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@1f466074 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@45666fb6 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@48303952 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@56035b84 could not be cancelled
[2015-04-28 13:52:26,780][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@766d6339 could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@78d6fed could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@56af347a could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@89e620c could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@5637063f could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@367b3e51 could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@553853c8 could not be cancelled
[2015-04-28 13:52:26,781][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@65049b2d could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@1167db2b could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@3965ecfa could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@52a92e86 could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@612d89cf could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@74b2a6e0 could not be cancelled
[2015-04-28 13:52:26,782][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@7ce9f7fa could not be cancelled
[2015-04-28 13:52:26,783][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@64977e27 could not be cancelled
[2015-04-28 13:52:26,783][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@15db6609 could not be cancelled
[2015-04-28 13:52:26,783][ERROR][KnapsackService          ] task java.util.concurrent.FutureTask@7707e414 could not be cancelled
[2015-04-28 13:52:26,783][INFO ][KnapsackService          ] shutdown
[2015-04-28 13:52:26,784][INFO ][KnapsackService          ] shutdown now
[2015-04-28 13:52:26,784][INFO ][KnapsackService          ] task queue size was 0, setting new executor

For now, I'll keep working around it by periodically restarting elasticsearch.

jolson · Post by **jolson** » Tue Apr 28, 2015 3:53 pm

jvestrum,

I appreciate you reporting back. Restarting Elasticsearch can be a pain - so I'd like to help you find the root cause of this problem.

Is there any 'special' configuration that you can think of on your machine? Any customizations? What does the load look like?

Code: Select all

top | head -n5
free -m
df -h

It's unusual for these backups to stick because typically config backups are rather small. What size are yours?

Code: Select all

ls -l -h /store/backups/nagioslogserver/

Is there anything notable in your cron log?

Code: Select all

cat /var/log/cron

The Elasticsearch errors are a great start, but why these backups are stuck in the first place is what I'm curious about. Is it true that your backups are still residing on the local disks of your instances?

jvestrum · Post by **jvestrum** » Wed Apr 29, 2015 10:21 am

> Is there any 'special' configuration that you can think of on your machine? Any customizations? What does the load look like?

I can't think of anything really special on these systems. We do run an iptables firewall (required by policy) but it's set to log anything that gets rejected, and nothing is hitting it. And I temporarily disabled the firewall on both node, but the problem still occurs.

Here's Node 1:

Code: Select all

top - 09:57:07 up 20 days, 23:13,  2 users,  load average: 0.08, 0.12, 0.09
Tasks: 269 total,   1 running, 268 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.9%sy,  0.1%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32876496k total, 18012912k used, 14863584k free,   727380k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7479188k cached
[root@gtcs-nls01 1430318281]# free -m
             total       used       free     shared    buffers     cached
Mem:         32105      17590      14515          0        710       7304
-/+ buffers/cache:       9576      22529
Swap:          999          0        999
[root@gtcs-nls01 1430318281]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-root
                      3.9G  544M  3.1G  15% /
tmpfs                  16G     0   16G   0% /dev/shm
/dev/sda1             194M   95M   89M  52% /boot
/dev/mapper/vg00-home
                      2.0G  250M  1.6G  14% /home
/dev/mapper/vg00-opt  3.9G  128M  3.5G   4% /opt
/dev/mapper/vg00-tmp  2.0G  538M  1.3G  30% /tmp
/dev/mapper/vg00-usr  3.9G  2.2G  1.5G  61% /usr
/dev/mapper/vg00-var  3.9G  1.1G  2.6G  31% /var
/dev/mapper/vg01-srv 1004G  2.2G  951G   1% /srv
hpc-archive:/archive/nagioslogserver
                       20T   17T  2.8T  87% /arcfiniti/nagioslogserver

And here's Node 2:

Code: Select all

top - 10:02:49 up 20 days, 23:18,  2 users,  load average: 0.10, 0.10, 0.09
Tasks: 267 total,   1 running, 266 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.9%sy,  0.1%ni, 96.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32876496k total, 18008764k used, 14867732k free,   727380k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7473512k cached
[root@gtcs-nls01 sysconfig]# free -m
             total       used       free     shared    buffers     cached
Mem:         32105      17587      14518          0        710       7298
-/+ buffers/cache:       9578      22527
Swap:          999          0        999
[root@gtcs-nls01 sysconfig]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-root
                      3.9G  544M  3.1G  15% /
tmpfs                  16G     0   16G   0% /dev/shm
/dev/sda1             194M   95M   89M  52% /boot
/dev/mapper/vg00-home
                      2.0G  250M  1.6G  14% /home
/dev/mapper/vg00-opt  3.9G  128M  3.5G   4% /opt
/dev/mapper/vg00-tmp  2.0G  538M  1.3G  30% /tmp
/dev/mapper/vg00-usr  3.9G  2.2G  1.5G  61% /usr
/dev/mapper/vg00-var  3.9G  1.1G  2.6G  31% /var
/dev/mapper/vg01-srv 1004G  2.2G  951G   1% /srv
hpc-archive:/archive/nagioslogserver
                       20T   17T  2.8T  87% /arcfiniti/nagioslogserver

> It's unusual for these backups to stick because typically config backups are rather small. What size are yours?

Not big:

Code: Select all

[root@gtcs-nls01 sysconfig]# ls -l -h /store/backups/nagioslogserver/
total 55M
drwxrwxrwx 2 nagios users 4.0K Apr 29 09:37 1430318242
drwxrwxrwx 2 nagios users 4.0K Apr 29 09:38 1430318281
-rw-r--r-- 1 nagios users 5.4M Apr 28 15:16 nagioslogserver.2015-04-17.1429281426.tar.gz
-rw-r--r-- 1 nagios users 5.5M Apr 28 15:16 nagioslogserver.2015-04-18.1429367881.tar.gz
-rw-r--r-- 1 nagios users 5.8M Apr 28 15:16 nagioslogserver.2015-04-20.1429540631.tar.gz
-rw-r--r-- 1 nagios users 6.0M Apr 28 15:16 nagioslogserver.2015-04-21.1429627031.tar.gz
-rw-r--r-- 1 nagios users 6.1M Apr 28 15:16 nagioslogserver.2015-04-22.1429713431.tar.gz
-rw-r--r-- 1 nagios users 6.2M Apr 28 15:16 nagioslogserver.2015-04-23.1429799837.tar.gz
-rw-r--r-- 1 nagios users 6.4M Apr 28 15:16 nagioslogserver.2015-04-24.1429886236.tar.gz
-rw-r--r-- 1 nagios users 6.8M Apr 28 15:16 nagioslogserver.2015-04-27.1430145446.tar.gz
-rw-r--r-- 1 nagios users 7.0M Apr 28 15:16 nagioslogserver.2015-04-28.1430231842.tar.gz

> Is there anything notable in your cron log?

Just a whole bunch of lines like this:

Code: Select all

Apr 29 10:03:01 gtcs-nls01 CROND[6657]: (nagios) CMD (/usr/bin/php -q /var/www/html/nagioslogserver/www/index.php poller > /usr/local/nagioslogserver/var/poller.log 2>&1)

And some of these:

Code: Select all

Apr 29 09:10:01 gtcs-nls01 CROND[26943]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:20:01 gtcs-nls01 CROND[28262]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:30:01 gtcs-nls01 CROND[29675]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:40:01 gtcs-nls01 CROND[31394]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 09:50:01 gtcs-nls01 CROND[1934]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 10:00:01 gtcs-nls01 CROND[5808]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 29 10:01:01 gtcs-nls01 CROND[6068]: (root) CMD (run-parts /etc/cron.hourly)
Apr 29 10:01:01 gtcs-nls01 run-parts(/etc/cron.hourly)[6068]: starting 0anacron
Apr 29 10:01:01 gtcs-nls01 run-parts(/etc/cron.hourly)[6077]: finished 0anacron

Looks pretty harmless.

> The Elasticsearch errors are a great start, but why these backups are stuck in the first place is what I'm curious about. Is it true that your backups are still residing on the local disks of your instances?

Yes, /store/backups/nagioslogserver/ is on local disk on each node. The other kind of backup - the one that uses Elasticsearch snapshot/restore to backup the actual data - is going to a shared filesystem (that's the hpc-archive:/archive/nagioslogserver in the df above). That one works fine.

jvestrum · Post by **jvestrum** » Wed Apr 29, 2015 2:48 pm

One thing I found in the elasticsearch log, although it may be completely unrelated:

Code: Select all

[2015-04-29 11:27:06,547][WARN ][common.jna               ] Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).

This is due to a line in /etc/security/limits.d/98-nagioslogserver.conf that sets memlock to 524288. I now set it to "unlimited" and that error went away when I restarted elasticsearch.

jolson · Post by **jolson** » Wed Apr 29, 2015 3:13 pm

Thank you for all of this information - I do not think that your memlock settings are related to this problem, but it's not out of the question. Let me know if create_backup.sh continues to hang - it would be very interesting if it does not. In the meantime, I'm trying to come up with a troubleshooting method to dig deeper. I'll let you know what I find out.

jvestrum · Post by **jvestrum** » Thu Apr 30, 2015 9:09 am

I at least figured out how to reproduce the issue on demand. Here are the steps:

1. Stop and restart elasticsearch on both nodes to clear out the state.
2. Run create_backup.sh on Node 1. It completes in 10 seconds or so.
3. Repeat step 2 any number of times on Node 1, it completes every time.
4. Then I run create_backup.sh on Node 2 - and it hangs the first time.

Then I repeated the above, swapping Node 1 and Node 2. Same results. I can run create_backup.sh over and over again, and it works fine - so long as I only run it on one node.

It's pulling the same data anyways, so I really only need it to run on one node (the resulting tarfiles get picked up by another backup product so they are safe). I guess if you can't reproduce this on your end, I'll just disable the job in Command Subsystem and run it on one node via cron instead.

Nagios Support Forum

create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging

Re: create_backup.sh hanging