Nagios Support Forum

Posted: **Fri Aug 07, 2015 10:56 am**

[root@logging nagioslogserver]# ./restorenagioslogserver.sh
Restoring nagioslogserver ...
[root@logging nagioslogserver]# cat state.json
{"count":0,"states":[]}

Not sure why, but it isn't working for me. Maybe it is possible to disable the auto-generation of the "nagioslogserver" index?

If not, maybe its time to start from scratch.

Posted: **Fri Aug 07, 2015 11:18 am**

I found a configuration setting that will allow us to disable automatic regeneration of the 'nagioslogserver' index.

Run the following command on *all* of your nodes:

Code: Select all

echo "action.auto_create_index: false" >> /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml

After the above has been run, try stopping elasticsearch on each node:

Code: Select all

service elasticsearch stop

Once elasticsearch has been fully stopped, restart it:

Code: Select all

service elasticsearch start

Delete the problem index:

Code: Select all

curl -XDELETE "http://localhost:9200/nagioslogserver/"

Ensure that it stays deleted:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'

Once you are sure that it is staying deleted, run our script and wait a couple of minutes:

Code: Select all

./restorenagioslogserver.sh

If all things go well, you should be up and running again.

Posted: **Fri Aug 07, 2015 12:20 pm**

Code: Select all

[root@logging nagioslogserver]# echo "action.auto_create_index: false" >> /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
[root@logging nagioslogserver]# service elasticsearch stop
Stopping elasticsearch:                                    [  OK  ]
[root@logging nagioslogserver]# service elasticsearch start
Starting elasticsearch:                                    [  OK  ]
[root@logging nagioslogserver]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
curl: (7) couldn't connect to host
[root@logging nagioslogserver]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
{"acknowledged":true}[root@logging nagioslogserver]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
[root@logging nagioslogserver]# cd /usr/local/sbin
[root@logging sbin]# ./restorenagioslogserver.sh
Restoring nagioslogserver ... [root@logging sbin]#
[root@logging sbin]#
[root@logging sbin]# curl "http://localhost:9200/nagioslogserver/user/_search?pretty"
{
  "error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
  "status" : 503
}
[root@logging sbin]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
    "nagioslogserver" : {

I tried this a few times with no luck. I tried creating the "someuser" user and still get the UnavailableShardsException. So strange.

Posted: **Fri Aug 07, 2015 1:03 pm**

Is it possible that the backup that we're restoring from is corrupt?

Let's check on the status of the restored index:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' -A10

Is the health status of the 'nagioslogserver' index still red? Mine took a couple of minutes to spin up properly - but if the index is still in a red state after the restore, it would be worth trying to restore from a different backup to see if that makes a difference. At this point, you can edit my script and point the backup to pull from a difference /store/backups/nagioslogserver folder that you untar.

The 'UnavailableShardsException' I expect is occuring due to the corruption of the 'nagioslogserver' index. Let me know if there are any other backups you can try restoring from - I fear that running out of disk space may have permanently affected the system.

Another thought that I have is that you have many indices in a corrupt state currently - since the system is unrecoverable at this point, we could try deleting *all* of the red indices and trying to restore from your backup 'nagioslogserver' index afterward. Does that make sense?

Below is a list of all of your corrupt indices:

Code: Select all

logstash-2015.07.24
logstash-2015.07.25
logstash-2015.07.26
logstash-2015.07.27
logstash-2015.07.28
logstash-2015.07.29
logstash-2015.07.30
logstash-2015.07.31
logstash-2015.08.01
logstash-2015.08.02
logstash-2015.08.03
logstash-2015.08.04
logstash-2015.08.05
logstash-2015.08.06

You are free to run the delete command against all of those indices if you aren't concerned about the data in any of them. It's possible that elasticsearch isn't allocating the 'nagioslogserver' index properly due to all of the above indices.

Let me know what you think.

Jesse

Posted: **Fri Aug 07, 2015 2:59 pm**

Hi Jesse,

I'm assuming you are correct and the backups are probably corrupt.

You have been really great in supporting this issue, and I have no doubt that it would have worked if I had a healthy backup.

I can't spend any more time trying to troubleshoot this issue. I copied the inputs/outputs/filters .conf files and I'm going to start over.

Thanks again,

Kyle

Posted: **Mon Aug 10, 2015 9:17 am**

Kyle,

That sounds like a plan. Let me know if you need any assistance along the way. I'll lock this thread.

Jesse

Nagios Support Forum

Nagios Log Server removed Nagiosadmin + shardexception error

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Re: Nagios Log Server removed Nagiosadmin + shardexception e