create_backup.sh hanging

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
jvestrum
Posts: 46
Joined: Tue Mar 03, 2015 10:45 am

create_backup.sh hanging

Post by jvestrum »

Hello,

I'm seeing the create_backup.sh hanging issue others have reported. It is stuck in the loop running curl -XPOST 'http://localhost:9200/_export/state'.

The state.json contains:

Code: Select all

{"count":15,"states":[{"mode":"export","started":"2015-03-13T20:31:37.073Z","path":"file:///store/backups/nagioslogserver/1426278696/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-13T20:31:37.173Z","path":"file:///store/backups/nagioslogserver/1426278696/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-14T20:31:37.669Z","path":"file:///store/backups/nagioslogserver/1426365097/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.416Z","path":"file:///store/backups/nagioslogserver/1426537897/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.516Z","path":"file:///store/backups/nagioslogserver/1426537897/kibana-int.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-16T20:31:37.913Z","path":"file:///store/backups/nagioslogserver/1426537897/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-16T20:31:37.967Z","path":"file:///store/backups/nagioslogserver/1426537897/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-17T20:31:38.594Z","path":"file:///store/backups/nagioslogserver/1426624298/nagioslogserver.tar.gz","node_name":"a6a1ee31-789f-4927-8680-25814f651b54"},{"mode":"export","started":"2015-03-17T20:31:38.894Z","path":"file:///store/backups/nagioslogserver/1426624298/nagioslogserver_log.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-28T20:31:51.720Z","path":"file:///store/backups/nagioslogserver/1427574711/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-03-29T20:31:52.050Z","path":"file:///store/backups/nagioslogserver/1427661111/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-01T20:32:02.487Z","path":"file:///store/backups/nagioslogserver/1427920322/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-01T20:32:02.657Z","path":"file:///store/backups/nagioslogserver/1427920322/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-02T20:32:06.817Z","path":"file:///store/backups/nagioslogserver/1428006726/nagioslogserver.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"},{"mode":"export","started":"2015-04-02T20:32:06.968Z","path":"file:///store/backups/nagioslogserver/1428006726/kibana-int.tar.gz","node_name":"fd218450-44e4-4ed2-805a-74c1a72a2b63"}]}
I have tried running "curl -XPOST 'http://localhost:9200/_export/abort" on both instances, and also restarted elasticsearch on both instances, but knapsack state continues to report those 15 hung exports. The paths are world-writeable, and /store/backups/nagioslogserver/ is local disk on each instance.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: create_backup.sh hanging

Post by jolson »

Could you please post the output of:

Code: Select all

ps aux
It may show if several backups are already running. If so, we could kill them and try running another manually to see if it finishes.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
jvestrum
Posts: 46
Joined: Tue Mar 03, 2015 10:45 am

Re: create_backup.sh hanging

Post by jvestrum »

jolson wrote:Could you please post the output of:

Code: Select all

ps aux
It may show if several backups are already running. If so, we could kill them and try running another manually to see if it finishes.
When I restarted elasticsearch, all the running create_backups exited (since it takes more than 5 seconds for it to restart, it breaks out of the "sleep 5" loop in the script when the curl fails). So there were no more running. I started another one manually and it's stuck in that same loop again:

Code: Select all

[nagios@host ~]$ /usr/local/nagioslogserver/scripts/create_backup.sh 
Starting Nagios Log Server Backup
---------------------------------
Backing up indexes.nagioslogserver ... kibana-int ... nagioslogserver_log ... Completed.
Waiting for backup. This may take a while.
.........................................................................................................................................................
Here's my ps aux listing:
https://gist.github.com/jvestrum/01135f1c67fb1bde183f

There's plenty of free space:

Code: Select all

[nagios@host scripts]$ df -h /store/backups/nagioslogserver/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-root
                      3.9G  489M  3.2G  14% /
Here's the backup directory, /store/backups/nagioslogserver/1428343251/. It looks like the files were successfully written, and haven't changed in 18 minutes. But the state.json still looks almost the same as I posted before, except it's up to count:16 now instead of 15.

Code: Select all

[nagios@host 1428343251]$ ls -lrt
total 4280
-rw-r--r-- 1 nagios users    6243 Apr  6 13:00 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users    2949 Apr  6 13:00 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4362246 Apr  6 13:00 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users    2994 Apr  6 13:18 state.json
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: create_backup.sh hanging

Post by jolson »

It's interesting that your backup is hanging - I understand that you said the directory was world write-able - can you perform a recursive permissions check and compare them to mine? I have verified that mine is working properly.

Code: Select all

ls -d /store ; ls -lR /store
My output:
ls -d /store ; ls -lR /store
/store
/store:
total 4
drwxr-xr-x. 3 nagios nagios 4096 Feb 12 18:24 backups

/store/backups:
total 4
drwxr-xr-x. 2 nagios nagios 4096 Apr 6 14:39 nagioslogserver

/store/backups/nagioslogserver:
total 68
-rw-r--r-- 1 root root 10550 Apr 6 11:42 nagioslogserver.2015-04-06.1428334936.tar.gz
-rw-r--r-- 1 root root 10576 Apr 6 11:42 nagioslogserver.2015-04-06.1428334951.tar.gz
-rw-r--r-- 1 root root 10611 Apr 6 11:43 nagioslogserver.2015-04-06.1428334973.tar.gz
-rw-r--r-- 1 root root 29912 Apr 6 14:39 nagioslogserver.2015-04-06.1428345571.tar.gz
I also notice that the backup script is running under the nagios user. What happens if you run the script as root?

Code: Select all

bash /usr/local/nagioslogserver/scripts/create_backup.sh
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
jvestrum
Posts: 46
Joined: Tue Mar 03, 2015 10:45 am

Re: create_backup.sh hanging

Post by jvestrum »

I cancelled the backup and re-ran it as root, but it's still stuck in the same place.

Elasticsearch (which is running as user nagios) is successfully writing the files, as you can see below. I just need to figure out how to tell Knapsack to clear out those 16 hanging exports, so it can break out of that wait loop.
[a5cltzz@gtcs-nls01 ~]$ ls -d /store; ls -lR /store
/store
/store:
total 4
drwxr-xr-x 3 nagios nagios 4096 Mar 12 11:13 backups

/store/backups:
total 4
drwxr-xr-x 6 nagios nagios 4096 Apr 6 13:54 nagioslogserver

/store/backups/nagioslogserver:
total 7552
drwxrwxrwx 2 nagios users 4096 Apr 6 09:49 1428331795
drwxrwxrwx 2 nagios users 4096 Apr 6 13:00 1428343251
drwxrwxrwx 2 root root 4096 Apr 6 13:54 1428346483
-rw-r--r-- 1 nagios users 3857481 Apr 6 09:27 nagioslogserver.2015-04-06.1428329678.tar.gz
-rw-r--r-- 1 nagios users 3860690 Apr 6 09:43 nagioslogserver.2015-04-06.1428330449.tar.gz

/store/backups/nagioslogserver/1428331795:
total 4256
-rw-r--r-- 1 nagios users 2949 Apr 6 09:49 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4338537 Apr 6 09:50 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 7594 Apr 6 09:49 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users 2807 Apr 6 09:50 state.json

/store/backups/nagioslogserver/1428343251:
total 4280
-rw-r--r-- 1 nagios users 2949 Apr 6 13:00 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4362246 Apr 6 13:00 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 6243 Apr 6 13:00 nagioslogserver.tar.gz
-rw-r--r-- 1 nagios users 2994 Apr 6 13:54 state.json

/store/backups/nagioslogserver/1428346483:
total 4284
-rw-r--r-- 1 nagios users 2954 Apr 6 13:54 kibana-int.tar.gz
-rw-r--r-- 1 nagios users 4368903 Apr 6 13:54 nagioslogserver_log.tar.gz
-rw-r--r-- 1 nagios users 6235 Apr 6 13:54 nagioslogserver.tar.gz
-rw-r--r-- 1 root root 2994 Apr 6 13:55 state.json
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: create_backup.sh hanging

Post by tmcdonald »

Can we get your knapsack/other plugins' versions?

Code: Select all

curl 'localhost:9200/_cat/plugins?v'
For reference, this is me:

Code: Select all

knapsack-1.3.2.0-d5501ef
Former Nagios employee
jvestrum
Posts: 46
Joined: Tue Mar 03, 2015 10:45 am

Re: create_backup.sh hanging

Post by jvestrum »

tmcdonald wrote:Can we get your knapsack/other plugins' versions?

Code: Select all

curl 'localhost:9200/_cat/plugins?v'
For reference, this is me:

Code: Select all

knapsack-1.3.2.0-d5501ef
Same version:

Code: Select all

name                                 component                version type url 
a6a1ee31-789f-4927-8680-25814f651b54 knapsack-1.3.2.0-d5501ef 1.3.2.0 j        
fd218450-44e4-4ed2-805a-74c1a72a2b63 knapsack-1.3.2.0-d5501ef 1.3.2.0 j        
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: create_backup.sh hanging

Post by jolson »

I have been trying to reproduce this issue in-house with no success.

What happens if you navigate to the backup directory (containing state.json) and run the following:

Code: Select all

python -m jsonselect.__main__ .count < state.json
Does it still show a high number of export jobs?

I have found that if the backup script is interrupted, an export will hang until elasticsearch restarts - this prevents further backups from occurring.

Could you please cat your backup script so that we can ensure there aren't any differences?

Code: Select all

cat /usr/local/nagioslogserver/scripts/create_backup.sh
I am still investigating this on my end - aborts are not working as they should be. The command you mentioned in your first post:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/abort
Does not abort the running exports properly. I will do further investigating.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
jvestrum
Posts: 46
Joined: Tue Mar 03, 2015 10:45 am

Re: create_backup.sh hanging

Post by jvestrum »

jolson wrote:I have been trying to reproduce this issue in-house with no success.

What happens if you navigate to the backup directory (containing state.json) and run the following:

Code: Select all

python -m jsonselect.__main__ .count < state.json
Does it still show a high number of export jobs?
Yes, it's now up to 21:

Code: Select all

#     python -m jsonselect.__main__ .count < state.json
21
jolson wrote: I have found that if the backup script is interrupted, an export will hang until elasticsearch restarts - this prevents further backups from occurring.

Could you please cat your backup script so that we can ensure there aren't any differences?

Code: Select all

cat /usr/local/nagioslogserver/scripts/create_backup.sh
Yes, here it is. I haven't modified it (as far as I remember:).

Code: Select all

# cat /usr/local/nagioslogserver/scripts/create_backup.sh
#!/bin/sh
#
# Bash script for creating Nagios Log Server backups
# Copyright 2014 - Nagios Enterprises LLC
#
# These backups are used to store the main databases for Nagios Log Server including the kibana
# database, log server's internal database, and log server's internal log database
#

INDEXNAMES=( "nagioslogserver" "kibana-int" "nagioslogserver_log" )
LOGSERVER_DIR="/usr/local/nagioslogserver"
BACKUP_DIR="/store/backups/nagioslogserver"
TIMESTAMP=$(date +%s)
DATE=$(date +%F)

# Create mapping files with the index mapping data
echo "Starting Nagios Log Server Backup"
echo "---------------------------------"
mkdir -p "$BACKUP_DIR/$TIMESTAMP"
chmod 777 "$BACKUP_DIR/$TIMESTAMP"

# Create a backup of each of the indexes and store them in our temp directory
echo -n "Backing up indexes."
cd "$BACKUP_DIR/$TIMESTAMP"
for index in "${INDEXNAMES[@]}"; do
    echo -n "$index ... "
    curl -XPOST http://localhost:9200/$index/_export?path=$BACKUP_DIR/$TIMESTAMP/$index.tar.gz > /dev/null 2>&1
done
echo "Completed."

# Wait for elasticsearch export jobs to finish...
echo "Waiting for backup. This may take a while."
count=3
while [[ $count -gt 0 ]]; do
        curl -s -XPOST 'http://localhost:9200/_export/state' > state.json
        count=$(python -m jsonselect.__main__ .count < state.json)
        echo -n "."
        sleep 5
done

# Compress entire directory into a single file
rm -rf state.json
cd $BACKUP_DIR
dirname="nagioslogserver.$DATE.$TIMESTAMP"
mv $TIMESTAMP $dirname
tar czf "$BACKUP_DIR/$dirname.tar.gz" $dirname
rm -rf $dirname

echo ""
echo "Backup completed."
jolson wrote: I am still investigating this on my end - aborts are not working as they should be. The command you mentioned in your first post:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/abort
Does not abort the running exports properly. I will do further investigating.
Yeah, I've restarted elasticsearch, on both instances, both before and after running that "abort" command, but they still don't go away. I have not yet tried rebooting the servers, but I could do that anytime.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: create_backup.sh hanging

Post by jolson »

Could you please try rebooting the servers? I think that we have all of the information we can collect regarding this issue, and a reboot may very well resolve it.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked