Repository full

teirekos · Post by **teirekos** » Thu Sep 01, 2016 3:09 am

Hello,
I have a 3 node NLS cluster running the latest version.
The other day we had an incident in our environment so the logs on some servers increased exponentially... much more that anticipated thus my Repository at some point reached 100%.

This resulted that all snapshots are gone from the GUI (image attached).
What I did first, since I was able to do it, I increased the file system so it went to 91%.
Then via the Backup & Maintenance I altered values there in order to delete some snapshots and run the jobs again (see log output below).

---------------

Code: Select all

tail: /usr/local/nagioslogserver/var/jobs.log: file truncated
Running command do_maintenance with args ' ' for job id: backup_maintenance
2016-09-01 09:39:01,864 INFO      Job starting: optimize indices2016-09-01 09:39:01,864 WARNING   Overriding default connection timeout.  New timeout: 216002016-09-01 09:39:01,933 INFO      Action optimize will be performed on the following indices: [u'logstash-2016.08.12', u'logstash-2016.08.13', u'logstash-2016.08.14', u'logstash-2016.08.15', u'logstash-2016.08.16', u'logstash-2016.08.17', u'logstash-2016.08.18', u'logstash-2016.08.19', u'logstash-2016.08.20', u'logstash-2016.08.21', u'logstash-2016.08.22', u'logstash-2016.08.23', u'logstash-2016.08.24', u'logstash-2016.08.25', u'logstash-2016.08.26', u'logstash-2016.08.27', u'logstash-2016.08.28', u'logstash-2016.08.29', u'logstash-2016.08.30']2016-09-01 09:39:02,326 INFO      Job completed successfully.ine 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.6/site-packages/click/core.py", line 86, in augment_usage_errors
    yield
  File "/usr/lib/python2.6/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/curator/cli/index_selection.py", line 167, in indices
    retval = do_command(client, ctx.parent.info_name, working_list, ctx.parent.params, master_timeout)
  File "/usr/lib/python2.6/site-packages/curator/cli/utils.py", line 250, in do_command
    skip_repo_validation=params['skip_repo_validation'],
  File "/usr/lib/python2.6/site-packages/curator/api/snapshot.py", line 72, in create_snapshot
    if name in all_snaps:
TypeError: argument of type 'bool' is not iterable
2016-09-01 09:39:02,424 INFO      Job starting: snapshot indices2016-09-01 09:39:02,424 WARNING   Overriding default connection timeout.  New timeout: 216002016-09-01 09:39:02,436 INFO      Action snapshot will be performed on the following indices: [u'logstash-2016.08.12', u'logstash-2016.08.13', u'logstash-2016.08.14', u'logstash-2016.08.15', u'logstash-2016.08.16', u'logstash-2016.08.17', u'logstash-2016.08.18', u'logstash-2016.08.19', u'logstash-2016.08.20', u'logstash-2016.08.21', u'logstash-2016.08.22', u'logstash-2016.08.23', u'logstash-2016.08.24', u'logstash-2016.08.25', u'logstash-2016.08.26', u'logstash-2016.08.27', u'logstash-2016.08.28', u'logstash-2016.08.29', u'logstash-2016.08.30', u'logstash-2016.08.31']2016-09-01 09:39:03,140 INFO      Snapshot name: curator-201609010639032016-09-01 09:39:03,297 ERROR     Unable to find all snapshots in repository: NLSSnaps2016-09-01 09:39:03,756 INFO      Job starting: delete snapshots2016-09-01 09:39:03,917 ERROR     Unable to find all snapshots in repository: NLSSnaps2016-09-01 09:39:03,917 ERROR     No snapshots found in Elasticsearch.No snapshots found in Elasticsearch.2016-09-01 09:39:04,004 INFO      Job starting: delete indices2016-09-01 09:39:04,016 INFO      Pruning Kibana-related indices to prevent accidental deletion.2016-09-01 09:39:04,016 INFO      Action delete will be performed on the following indices: [u'logstash-2016.08.12', u'logstash-2016.08.13', u'logstash-2016.08.14', u'logstash-2016.08.15', u'logstash-2016.08.16', u'logstash-2016.08.17']2016-09-01 09:39:04,017 INFO      Deleting indices as a batch operation:2016-09-01 09:39:04,017 INFO      ---deleting index logstash-2016.08.122016-09-01 09:39:04,017 INFO      ---deleting index logstash-2016.08.132016-09-01 09:39:04,018 INFO      ---deleting index logstash-2016.08.142016-09-01 09:39:04,018 INFO      ---deleting index logstash-2016.08.152016-09-01 09:39:04,018 INFO      ---deleting index logstash-2016.08.162016-09-01 09:39:04,018 INFO      ---deleting index logstash-2016.08.172016-09-01 09:39:07,130 INFO      Job completed successfully.tail: /usr/local/nagioslogserver/var/jobs.log: file truncated
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS
Running command run_alerts with args ' ' for job id: run_all_alerts
SUCCESS

---------------------

I was also getting the message:
"[2016-09-01 09:22:37,824][INFO ][cluster.routing.allocation.decider] [845bc07c-ed91-4920-8e23-747c9cc699f5] low disk watermark [85%] exceeded on [lXjC93b5QMm2hsgX9odwZA][845bc07c-ed91-4920-8e23-747c9cc699f5] free: 11.5gb[11.7%], replicas will not be assigned to this node"

I was not aware that there was a threashold of 85% so I had to delete some indices to restore cluster health.

Anyway now the Health Status is green but I still cannot see my snapshots.
From CLI the Repository directory look like this:

----------------------

Code: Select all

total 208
drwx------   2 root   root    4096 Sep 18  2015 lost+found
-rw-r--r--   1 nagios users     22 May  1 17:04 tests-kZZz9XEMSSCVjbBJDvEDXw-mfeDlC7lScWV9EXNnkLcdQ
-rw-r--r--   1 nagios users     22 May  1 17:04 tests-kZZz9XEMSSCVjbBJDvEDXw-agVsrjUBQvSfPLFQeY2p1A
-rw-r--r--   1 nagios users     22 May 10 17:24 tests-XZeYQA6NQmGZsZZj8djxbQ-siq0YOCBQ4yCWb2epxpTqA
-rw-r--r--   1 nagios users     22 May 10 17:24 tests-XZeYQA6NQmGZsZZj8djxbQ-jGfhi6kTQV-7fA4ps7nVQw
-rw-r--r--   1 nagios users     22 Jul 29 17:26 tests-HzSOPt_iQJKjiUvdH5qD6g-P9hEYLDnQxqlb97xb2P9Qw
-rw-r--r--   1 nagios users    443 Aug 11 17:26 metadata-curator-20160811142626
-rw-r--r--   1 nagios users    506 Aug 11 17:27 snapshot-curator-20160811142626
-rw-r--r--   1 nagios users    443 Aug 12 17:25 metadata-curator-20160812142557
-rw-r--r--   1 nagios users    312 Aug 12 17:28 snapshot-curator-20160812142557
-rw-r--r--   1 nagios users    443 Aug 13 17:25 metadata-curator-20160813142543
-rw-r--r--   1 nagios users    313 Aug 13 17:28 snapshot-curator-20160813142543
-rw-r--r--   1 nagios users    443 Aug 14 17:25 metadata-curator-20160814142552
-rw-r--r--   1 nagios users    605 Aug 14 17:27 snapshot-curator-20160814142552
-rw-r--r--   1 nagios users    443 Aug 15 17:25 metadata-curator-20160815142529
-rw-r--r--   1 nagios users    311 Aug 15 17:26 snapshot-curator-20160815142529
-rw-r--r--   1 nagios users    443 Aug 16 17:25 metadata-curator-20160816142524
-rw-r--r--   1 nagios users    312 Aug 16 17:26 snapshot-curator-20160816142524
-rw-r--r--   1 nagios users    443 Aug 17 17:25 metadata-curator-20160817142536
-rw-r--r--   1 nagios users    314 Aug 17 17:27 snapshot-curator-20160817142536
-rw-r--r--   1 nagios users    443 Aug 18 17:26 metadata-curator-20160818142620
-rw-r--r--   1 nagios users    312 Aug 18 17:27 snapshot-curator-20160818142620
-rw-r--r--   1 nagios users    447 Aug 19 17:27 metadata-curator-20160819142704
-rw-r--r--   1 nagios users    312 Aug 19 17:28 snapshot-curator-20160819142704
-rw-r--r--   1 nagios users    443 Aug 20 17:26 metadata-curator-20160820142630
-rw-r--r--   1 nagios users    587 Aug 20 17:27 snapshot-curator-20160820142630
-rw-r--r--   1 nagios users    447 Aug 21 17:26 metadata-curator-20160821142617
-rw-r--r--   1 nagios users    307 Aug 21 17:27 snapshot-curator-20160821142617
-rw-r--r--   1 nagios users    443 Aug 22 17:25 metadata-curator-20160822142539
-rw-r--r--   1 nagios users    625 Aug 22 17:27 snapshot-curator-20160822142539
-rw-r--r--   1 nagios users    443 Aug 23 19:31 metadata-curator-20160823163132
-rw-r--r--   1 nagios users    311 Aug 23 19:40 snapshot-curator-20160823163132
-rw-r--r--   1 nagios users    443 Aug 24 19:31 metadata-curator-20160824163158
-rw-r--r--   1 nagios users    312 Aug 24 19:36 snapshot-curator-20160824163158
-rw-r--r--   1 nagios users    443 Aug 25 19:32 metadata-curator-20160825163230
-rw-r--r--   1 nagios users    320 Aug 25 19:36 snapshot-curator-20160825163230
-rw-r--r--   1 nagios users    443 Aug 26 19:33 metadata-curator-20160826163330
-rw-r--r--   1 nagios users    324 Aug 26 19:38 snapshot-curator-20160826163330
-rw-r--r--   1 nagios users    443 Aug 27 19:33 metadata-curator-20160827163351
-rw-r--r--   1 nagios users    328 Aug 27 19:36 snapshot-curator-20160827163351
-rw-r--r--   1 nagios users    443 Aug 28 19:34 metadata-curator-20160828163410
-rw-r--r--   1 nagios users    550 Aug 28 19:35 snapshot-curator-20160828163410
-rw-r--r--   1 nagios users    443 Aug 29 19:32 metadata-curator-20160829163222
-rw-r--r--   1 nagios users    333 Aug 29 19:33 snapshot-curator-20160829163222
-rw-r--r--   1 nagios users    443 Aug 30 19:31 metadata-curator-20160830163152
-rw-r--r--   1 nagios users    334 Aug 30 19:40 snapshot-curator-20160830163152
-rw-r--r--   1 nagios users    443 Aug 31 19:35 metadata-curator-20160831163558
drwxr-xr-x 167 nagios nagios 20480 Aug 31 19:35 indices
-rw-r--r--   1 nagios users      0 Aug 31 19:42 snapshot-curator-20160831163558
-rw-r--r--   1 nagios users      0 Aug 31 19:42 index
-rw-r--r--   1 nagios users      0 Sep  1 08:39 tests-svR0E6cGTkSd_nec4Hj-7g-master

----------------------------------

I guess the problem is the index file wich is empty? Why this happened? How can I see the existing Reposiroty again from the GUI?

Thanx a lot.
BR,
Kostas

Post by **mcapra** » Thu Sep 01, 2016 9:22 am

Lets start by removing what is an obviously failed snapshot. Please remove the following files from the repository:

Code: Select all

snapshot-curator-20160831163558
metadata-curator-20160831163558

That by itself should restore your backups in the GUI.

Next I would like to see some debug from the snapshot process. Share the output of the following (it might be quite long, send it to a file):

Code: Select all

curator --dry-run --debug snapshot --repository NLSSnaps indices --older-than 1 --time-unit days --timestring %Y.%m.%d

teirekos · Post by **teirekos** » Fri Sep 02, 2016 9:15 am

Hello,
After deleting the failed snapshots the Backup is "visible" again.
The dump is attached as requested.

Post by **mcapra** » Fri Sep 02, 2016 10:21 am

Everything looks fairly normal from that output. I would monitor your backups for a few days to ensure they are working as intended.

teirekos · Post by **teirekos** » Wed Sep 07, 2016 12:43 am

Thanx for your help. Pls close this post.

Nagios Support Forum

Repository full

Repository full

Re: Repository full

Re: Repository full

Re: Repository full

Re: Repository full