Nagios Support Forum

Posted: **Tue Oct 22, 2019 10:02 am**

Good morning, I have an issue I'm trying to resolve, but I can't find an answer anywhere.

There seems to be a bit of snag in my snapshot process. There are two snapshots running simultaneously from earlier in the week. How can I stop them? They show finished in the Command Subsystem, but IN_PROGRESS on the Snapshots $ Maintenance page.

Code: Select all

{
    "snapshot" : "curator-20191021053102",
    "version_id" : 1070699,
    "version" : "1.7.6",
    "indices" : [ "logstash-2019.09.30", "logstash-2019.10.01", "logstash-2019.10.02", "logstash-2019.10.03", "logstash-2019.10.04", "logstash-2019.10.05", "logstash-2019.10.06", "logstash-2019.10.07", "logstash-2019.10.08", "logstash-2019.10.09", "logstash-2019.10.10", "logstash-2019.10.11", "logstash-2019.10.12", "logstash-2019.10.13", "logstash-2019.10.14", "logstash-2019.10.15", "logstash-2019.10.16", "logstash-2019.10.17", "logstash-2019.10.18", "logstash-2019.10.19", "logstash-2019.10.20" ],
    "state" : "IN_PROGRESS",
    "start_time" : "2019-10-21T05:30:56.978Z",
    "start_time_in_millis" : 1571635856978,
    "failures" : [ ],
    "shards" : {
      "total" : 0,
      "failed" : 0,
      "successful" : 0
    }
  }, {
    "snapshot" : "curator-20191022053043",
    "version_id" : 1070699,
    "version" : "1.7.6",
    "indices" : [ "logstash-2019.10.01", "logstash-2019.10.02", "logstash-2019.10.03", "logstash-2019.10.04", "logstash-2019.10.05", "logstash-2019.10.06", "logstash-2019.10.07", "logstash-2019.10.08", "logstash-2019.10.09", "logstash-2019.10.10", "logstash-2019.10.11", "logstash-2019.10.12", "logstash-2019.10.13", "logstash-2019.10.14", "logstash-2019.10.15", "logstash-2019.10.16", "logstash-2019.10.17", "logstash-2019.10.18", "logstash-2019.10.19", "logstash-2019.10.20", "logstash-2019.10.21" ],
    "state" : "IN_PROGRESS",
    "start_time" : "2019-10-22T05:30:44.476Z",
    "start_time_in_millis" : 1571722244476,
    "failures" : [ ],
    "shards" : {
      "total" : 0,
      "failed" : 0,
      "successful" : 0
    }
  } ]
}

It's causing all sorts of issues right now. My cluster status is RED and the whole environment is running very slow.

Posted: **Tue Oct 22, 2019 10:31 am**

Generally the way to fix this would be to restart elasticsearch on each node, one by one, however with cluster status being red, I believe you may have a different issue, can you show the output of the following:

Code: Select all

curl localhost:9200/_cluster/health?pretty

And also, run the following on each instance and show the output of each

Code: Select all

df -h

Posted: **Tue Oct 22, 2019 10:36 am**

Here you go:

Code: Select all

root@nagioslscc2:/root> curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "e4f9550c-f37c-417f-9cdc-283429a2a0a1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 122,
  "active_shards" : 205,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 43,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

root@nagioslscc2:/root> df -h
Filesystem                               Size  Used Avail Use% Mounted on
devtmpfs                                  32G     0   32G   0% /dev
tmpfs                                     32G     0   32G   0% /dev/shm
tmpfs                                     32G   50M   32G   1% /run
tmpfs                                     32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/centos_nagioslssc2temp-root   50G  6.0G   45G  12% /
/dev/sda1                               1014M  216M  799M  22% /boot
/dev/mapper/nagiosvg-nagioslog           6.8T  4.6T  2.0T  71% /usr/local/nagioslogserver
/dev/mapper/centos_nagioslssc2temp-home   39G   33M   39G   1% /home
//10.128.207.113/NLSREPCC                204T  108T   96T  54% /nlsrepcc
doanfs001:/admin                          79G   39G   41G  49% /admin
tmpfs                                    6.3G     0  6.3G   0% /run/user/6603
tmpfs                                    6.3G     0  6.3G   0% /run/user/996

Code: Select all

root@nagioslscc1:/root>df -h
Filesystem                      Size  Used Avail Use% Mounted on
devtmpfs                         32G     0   32G   0% /dev
tmpfs                            32G     0   32G   0% /dev/shm
tmpfs                            32G   61M   32G   1% /run
tmpfs                            32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/centos7-root        1.9G  743M  1.1G  42% /
/dev/mapper/centos7-usr         8.8G  5.0G  3.4G  60% /usr
/dev/sda1                       477M  193M  255M  43% /boot
/dev/mapper/centos7-simpana     4.3G   18M  4.0G   1% /opt/sw
/dev/mapper/centos7-home        4.8G   33M  4.5G   1% /home
/dev/mapper/centos7-tmp         4.8G   21M  4.6G   1% /tmp
/dev/mapper/centos7-var         3.9G  999M  2.7G  28% /var
/dev/mapper/nagiosvg-nagioslog  6.8T  3.6T  2.9T  56% /usr/local/nagioslogserver
//10.128.207.113/NLSREPCC       204T  108T   96T  54% /nlsrepcc
tmpfs                           6.3G     0  6.3G   0% /run/user/6603
tmpfs                           6.3G     0  6.3G   0% /run/user/987

Code: Select all

root@nagioslscc3:/root>df -h
Filesystem                     Size  Used Avail Use% Mounted on
devtmpfs                        32G     0   32G   0% /dev
tmpfs                           32G     0   32G   0% /dev/shm
tmpfs                           32G   53M   32G   1% /run
tmpfs                           32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/centos7-root       1.9G  756M 1015M  43% /
/dev/mapper/centos7-usr        8.8G  5.1G  3.3G  62% /usr
/dev/sda1                      477M  350M   98M  79% /boot
/dev/mapper/centos7-simpana    4.3G   18M  4.0G   1% /opt/sw
/dev/mapper/centos7-home       4.8G   33M  4.5G   1% /home
/dev/mapper/centos7-var        3.9G  804M  2.9G  22% /var
/dev/mapper/centos7-tmp        4.8G   21M  4.6G   1% /tmp
/dev/mapper/nagiosvg-nagioslv  6.7T  2.0T  4.5T  31% /usr/local/nagioslogserver
//10.128.207.113/NLSREPCC      204T  108T   96T  54% /nlsrepcc
tmpfs                          6.4G     0  6.4G   0% /run/user/3033
tmpfs                          6.4G     0  6.4G   0% /run/user/3032

Posted: **Tue Oct 22, 2019 10:42 am**

It currently shows it is trying to bring the shards back online

Code: Select all

  "initializing_shards" : 6,
  "unassigned_shards" : 43,

I would wait a bit and run that again to see if the unassigned_shards drops, otherwise I would restart elasticsearch one server at a time

Posted: **Tue Oct 22, 2019 10:45 am**

Ok, I'll keep an eye on it.

How about stopping the stale snapshots? Is there anyway to just stop each curator job from the CLI?

Posted: **Tue Oct 22, 2019 10:55 am**

You could kill the curator job, but it will still list the snapshot as IN_PROGRESS until it either finishes or ES restarts on all instances

Posted: **Tue Oct 22, 2019 11:02 am**

Ok, I'll just let it ride out.

I'll delay my next snapshot to hopefully give these two time to finish whatever the heck they're doing.

Posted: **Tue Oct 22, 2019 11:04 am**

rferebee wrote:Ok, I'll just let it ride out.

I'll delay my next snapshot to hopefully give these two time to finish whatever the heck they're doing.

sounds good

Posted: **Wed Oct 23, 2019 9:16 am**

I think I may have some corrupt shards. After letting it go overnight without running a snapshot there are still 10 unassigned shards with 0 relocating and 0 initializing.

Posted: **Wed Oct 23, 2019 9:23 am**

Also, the two snapshots I told you about that were still "IN_PROGRESS" are still running.

Nagios Support Forum

How to stop a currently running snapshot?

How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?

Re: How to stop a currently running snapshot?