Page 1 of 3

create_backup.sh hanging

Posted: Mon Apr 06, 2015 10:19 am
by jvestrum
Hello,

I'm seeing the create_backup.sh hanging issue others have reported. It is stuck in the loop running curl -XPOST 'http://localhost:9200/_export/state'.

The state.json contains:

Code: Select all

{"count":15,"states":[{"mode":"export","started":"2015-03-13T20:31:37.073Z","path":"file:///store/backups/nagioslogserver/1426278696/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-13T20:31:37.173Z","path":"file:///store/backups/nagioslogserver/1426278696/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-14T20:31:37.669Z","path":"file:///store/backups/nagioslogserver/1426365097/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.416Z","path":"file:///store/backups/nagioslogserver/1426537897/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.516Z","path":"file:///store/backups/nagioslogserver/1426537897/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.913Z","path":"file:///store/backups/nagioslogserver/1426537897/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-16T20:31:37.967Z","path":"file:///store/backups/nagioslogserver/1426537897/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-17T20:31:38.594Z","path":"file:///store/backups/nagioslogserver/1426624298/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-17T20:31:38.894Z","path":"file:///store/backups/nagioslogserver/1426624298/nagioslogserver_log.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-28T20:31:51.720Z","path":"file:///store/backups/nagioslogserver/1427574711/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-29T20:31:52.050Z","path":"file:///store/backups/nagioslogserver/1427661111/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-01T20:32:02.487Z","path":"file:///store/backups/nagioslogserver/1427920322/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-01T20:32:02.657Z","path":"file:///store/backups/nagioslogserver/1427920322/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-02T20:32:06.817Z","path":"file:///store/backups/nagioslogserver/1428006726/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-02T20:32:06.968Z","path":"file:///store/backups/nagioslogserver/1428006726/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"}]}
I have tried running "curl -XPOST 'http://localhost:9200/_export/abort" on both instances, and also restarted elasticsearch on both instances, but knapsack state continues to report those 15 hung exports. The paths are world-writeable, and /store/backups/nagioslogserver/ is local disk on each instance.

Re: create_backup.sh hanging

Posted: Mon Apr 06, 2015 10:43 am
by jolson
Could you please post the output of:

Code: Select all

ps aux
It may show if several backups are already running. If so, we could kill them and try running another manually to see if it finishes.

Re: create_backup.sh hanging

Posted: Mon Apr 06, 2015 1:23 pm
by jvestrum
jolson wrote:Could you please post the output of:

Code: Select all

ps aux
It may show if several backups are already running. If so, we could kill them and try running another manually to see if it finishes.
When I restarted elasticsearch, all the running create_backups exited (since it takes more than 5 seconds for it to restart, it breaks out of the "sleep 5" loop in the script when the curl fails). So there were no more running. I started another one manually and it's stuck in that same loop again:

Code: Select all

[nagios@host ~]$ /usr/local/nagioslogserver/scripts/create_backup.sh 
Starting Nagios Log Server Backup
---------------------------------
Backing up indexes.nagioslogserver ... kibana-int ... nagioslogserver_log ... Completed.
Waiting for backup. This may take a while.
.........................................................................................................................................................
Here's my ps aux listing:
https://gist.github.com/jvestrum/01135f1c67fb1bde183f

There's plenty of free space:

Code: Select all

[nagios@host scripts]$ df -h /store/backups/nagioslogserver/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-root
                      3.9G  489M  3.2G  14% /
Here's the backup directory, /store/backups/nagioslogserver/1428343251/. It looks like the files were successfully written, and haven't changed in 18 minutes. But the state.json still looks almost the same as I posted before, except it's up to count:16 now instead of 15.

Code: Select all

[nagios@host 1428343251]$ ls -lrt
total 4280
-rw-r--r-- 1 nagios users    6243 Apr  6 13:00 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users    2949 Apr  6 13:00 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4362246 Apr  6 13:00 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users    2994 Apr  6 13:18 state.json

Re: create_backup.sh hanging

Posted: Mon Apr 06, 2015 1:43 pm
by jolson
It's interesting that your backup is hanging - I understand that you said the directory was world write-able - can you perform a recursive permissions check and compare them to mine? I have verified that mine is working properly.

Code: Select all

ls -d /store ; ls -lR /store
My output:
ls -d /store ; ls -lR /store
/store
/store:
total 4
drwxr-xr-x. 3 nagios nagios 4096 Feb 12 18:24 backups

/store/backups:
total 4
drwxr-xr-x. 2 nagios nagios 4096 Apr 6 14:39 nagioslogserver

/store/backups/nagioslogserver:
total 68
-rw-r--r-- 1 root root 10550 Apr 6 11:42 nagioslogserver.2015-04-06.1428334936.tar.gz
-rw-r--r-- 1 root root 10576 Apr 6 11:42 nagioslogserver.2015-04-06.1428334951.tar.gz
-rw-r--r-- 1 root root 10611 Apr 6 11:43 nagioslogserver.2015-04-06.1428334973.tar.gz
-rw-r--r-- 1 root root 29912 Apr 6 14:39 nagioslogserver.2015-04-06.1428345571.tar.gz
I also notice that the backup script is running under the nagios user. What happens if you run the script as root?

Code: Select all

bash /usr/local/nagioslogserver/scripts/create_backup.sh

Re: create_backup.sh hanging

Posted: Mon Apr 06, 2015 2:12 pm
by jvestrum
I cancelled the backup and re-ran it as root, but it's still stuck in the same place.

Elasticsearch (which is running as user nagios) is successfully writing the files, as you can see below. I just need to figure out how to tell Knapsack to clear out those 16 hanging exports, so it can break out of that wait loop.
[a5cltzz@gtcs-nls01 ~]$ ls -d /store; ls -lR /store
/store
/store:
total 4
drwxr-xr-x 3 nagios nagios 4096 Mar 12 11:13 backups

/store/backups:
total 4
drwxr-xr-x 6 nagios nagios 4096 Apr 6 13:54 nagioslogserver

/store/backups/nagioslogserver:
total 7552
drwxrwxrwx 2 nagios users 4096 Apr 6 09:49 1428331795
drwxrwxrwx 2 nagios users 4096 Apr 6 13:00 1428343251
drwxrwxrwx 2 root root 4096 Apr 6 13:54 1428346483
-rw-r--r-- 1 nagios users 3857481 Apr 6 09:27 nagioslogserver.2015-04-06.1428329678.tar.gz
-rw-r--r-- 1 nagios users 3860690 Apr 6 09:43 nagioslogserver.2015-04-06.1428330449.tar.gz

/store/backups/nagioslogserver/1428331795:
total 4256
-rw-r--r-- 1 nagios users 2949 Apr 6 09:49 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4338537 Apr 6 09:50 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 7594 Apr 6 09:49 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users 2807 Apr 6 09:50 state.json

/store/backups/nagioslogserver/1428343251:
total 4280
-rw-r--r-- 1 nagios users 2949 Apr 6 13:00 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4362246 Apr 6 13:00 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 6243 Apr 6 13:00 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users 2994 Apr 6 13:54 state.json

/store/backups/nagioslogserver/1428346483:
total 4284
-rw-r--r-- 1 nagios users 2954 Apr 6 13:54 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4368903 Apr 6 13:54 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 6235 Apr 6 13:54 nagioslogserver.tar.gz
-rw-r--r-- 1 root root 2994 Apr 6 13:55 state.json

Re: create_backup.sh hanging

Posted: Tue Apr 07, 2015 11:23 am
by tmcdonald
Can we get your knapsack/other plugins' versions?

Code: Select all

curl 'localhost:9200/_cat/plugins?v'
For reference, this is me:

Code: Select all

knapsack-1.3.2.0-d5501ef

Re: create_backup.sh hanging

Posted: Wed Apr 08, 2015 8:53 am
by jvestrum
tmcdonald wrote:Can we get your knapsack/other plugins' versions?

Code: Select all

curl 'localhost:9200/_cat/plugins?v'
For reference, this is me:

Code: Select all

knapsack-1.3.2.0-d5501ef
Same version:

Code: Select all

name                                 component                version type url 
a6a1ee31-789f-4927-8680-25814f651b54 knapsack-1.3.2.0-d5501ef 1.3.2.0 j        
fd218450-44e4-4ed2-805a-74c1a72a2b63 knapsack-1.3.2.0-d5501ef 1.3.2.0 j        

Re: create_backup.sh hanging

Posted: Wed Apr 08, 2015 9:47 am
by jolson
I have been trying to reproduce this issue in-house with no success.

What happens if you navigate to the backup directory (containing state.json) and run the following:

Code: Select all

python -m jsonselect.__main__ .count < state.json
Does it still show a high number of export jobs?

I have found that if the backup script is interrupted, an export will hang until elasticsearch restarts - this prevents further backups from occurring.

Could you please cat your backup script so that we can ensure there aren't any differences?

Code: Select all

cat /usr/local/nagioslogserver/scripts/create_backup.sh
I am still investigating this on my end - aborts are not working as they should be. The command you mentioned in your first post:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/abort
Does not abort the running exports properly. I will do further investigating.

Re: create_backup.sh hanging

Posted: Wed Apr 08, 2015 10:04 am
by jvestrum
jolson wrote:I have been trying to reproduce this issue in-house with no success.

What happens if you navigate to the backup directory (containing state.json) and run the following:

Code: Select all

python -m jsonselect.__main__ .count < state.json
Does it still show a high number of export jobs?
Yes, it's now up to 21:

Code: Select all

#     python -m jsonselect.__main__ .count < state.json
21
jolson wrote: I have found that if the backup script is interrupted, an export will hang until elasticsearch restarts - this prevents further backups from occurring.

Could you please cat your backup script so that we can ensure there aren't any differences?

Code: Select all

cat /usr/local/nagioslogserver/scripts/create_backup.sh
Yes, here it is. I haven't modified it (as far as I remember:).

Code: Select all

# cat /usr/local/nagioslogserver/scripts/create_backup.sh
#!/bin/sh
#
# Bash script for creating Nagios Log Server backups
# Copyright 2014 - Nagios Enterprises LLC
#
# These backups are used to store the main databases for Nagios Log Server including the kibana
# database, log server's internal database, and log server's internal log database
#

INDEXNAMES=( "nagioslogserver" "kibana-int" "nagioslogserver_log" )
LOGSERVER_DIR="/usr/local/nagioslogserver"
BACKUP_DIR="/store/backups/nagioslogserver"
TIMESTAMP=$(date +%s)
DATE=$(date +%F)

# Create mapping files with the index mapping data
echo "Starting Nagios Log Server Backup"
echo "---------------------------------"
mkdir -p "$BACKUP_DIR/$TIMESTAMP"
chmod 777 "$BACKUP_DIR/$TIMESTAMP"

# Create a backup of each of the indexes and store them in our temp directory
echo -n "Backing up indexes."
cd "$BACKUP_DIR/$TIMESTAMP"
for index in "${INDEXNAMES[@]}"; do
    echo -n "$index ... "
    curl -XPOST http://localhost:9200/$index/_export?path=$BACKUP_DIR/$TIMESTAMP/$index.tar.gz > /dev/null 2>&1
done
echo "Completed."

# Wait for elasticsearch export jobs to finish...
echo "Waiting for backup. This may take a while."
count=3
while [[ $count -gt 0 ]]; do
        curl -s -XPOST 'http://localhost:9200/_export/state' > state.json
        count=$(python -m jsonselect.__main__ .count < state.json)
        echo -n "."
        sleep 5
done

# Compress entire directory into a single file
rm -rf state.json
cd $BACKUP_DIR
dirname="nagioslogserver.$DATE.$TIMESTAMP"
mv $TIMESTAMP $dirname
tar czf "$BACKUP_DIR/$dirname.tar.gz" $dirname
rm -rf $dirname

echo ""
echo "Backup completed."
jolson wrote: I am still investigating this on my end - aborts are not working as they should be. The command you mentioned in your first post:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/abort
Does not abort the running exports properly. I will do further investigating.
Yeah, I've restarted elasticsearch, on both instances, both before and after running that "abort" command, but they still don't go away. I have not yet tried rebooting the servers, but I could do that anytime.

Re: create_backup.sh hanging

Posted: Wed Apr 08, 2015 10:07 am
by jolson
Could you please try rebooting the servers? I think that we have all of the information we can collect regarding this issue, and a reboot may very well resolve it.