Page 2 of 2
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Fri Aug 07, 2015 10:56 am
by bostoneng
Code: Select all
[root@logging nagioslogserver]# ./restorenagioslogserver.sh
Restoring nagioslogserver ...
[root@logging nagioslogserver]# cat state.json
{"count":0,"states":[]}
Not sure why, but it isn't working for me. Maybe it is possible to disable the auto-generation of the "nagioslogserver" index?
If not, maybe its time to start from scratch.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Fri Aug 07, 2015 11:18 am
by jolson
I found a configuration setting that will allow us to disable automatic regeneration of the 'nagioslogserver' index.
Run the following command on *all* of your nodes:
Code: Select all
echo "action.auto_create_index: false" >> /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
After the above has been run, try stopping elasticsearch on each node:
Once elasticsearch has been fully stopped, restart it:
Delete the problem index:
Code: Select all
curl -XDELETE "http://localhost:9200/nagioslogserver/"
Ensure that it stays deleted:
Code: Select all
curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
Once you are sure that it is staying deleted, run our script and wait a couple of minutes:
If all things go well, you should be up and running again.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Fri Aug 07, 2015 12:20 pm
by bostoneng
Code: Select all
[root@logging nagioslogserver]# echo "action.auto_create_index: false" >> /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
[root@logging nagioslogserver]# service elasticsearch stop
Stopping elasticsearch: [ OK ]
[root@logging nagioslogserver]# service elasticsearch start
Starting elasticsearch: [ OK ]
[root@logging nagioslogserver]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
curl: (7) couldn't connect to host
[root@logging nagioslogserver]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
{"acknowledged":true}[root@logging nagioslogserver]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
[root@logging nagioslogserver]# cd /usr/local/sbin
[root@logging sbin]# ./restorenagioslogserver.sh
Restoring nagioslogserver ... [root@logging sbin]#
[root@logging sbin]#
[root@logging sbin]# curl "http://localhost:9200/nagioslogserver/user/_search?pretty"
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
"status" : 503
}
[root@logging sbin]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
"nagioslogserver" : {
I tried this a few times with no luck. I tried creating the "someuser" user and still get the UnavailableShardsException. So strange.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Fri Aug 07, 2015 1:03 pm
by jolson
Is it possible that the backup that we're restoring from is corrupt?
Let's check on the status of the restored index:
Code: Select all
curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' -A10
Is the health status of the 'nagioslogserver' index still red? Mine took a couple of minutes to spin up properly - but if the index is still in a red state after the restore, it would be worth trying to restore from a different backup to see if that makes a difference. At this point, you can edit my script and point the backup to pull from a difference /store/backups/nagioslogserver folder that you untar.
The 'UnavailableShardsException' I expect is occuring due to the corruption of the 'nagioslogserver' index. Let me know if there are any other backups you can try restoring from - I fear that running out of disk space may have permanently affected the system.
Another thought that I have is that you have many indices in a corrupt state currently - since the system is unrecoverable at this point, we could try deleting *all* of the red indices and trying to restore from your backup 'nagioslogserver' index afterward. Does that make sense?
Below is a list of all of your corrupt indices:
Code: Select all
logstash-2015.07.24
logstash-2015.07.25
logstash-2015.07.26
logstash-2015.07.27
logstash-2015.07.28
logstash-2015.07.29
logstash-2015.07.30
logstash-2015.07.31
logstash-2015.08.01
logstash-2015.08.02
logstash-2015.08.03
logstash-2015.08.04
logstash-2015.08.05
logstash-2015.08.06
You are free to run the delete command against all of those indices if you aren't concerned about the data in any of them. It's possible that elasticsearch isn't allocating the 'nagioslogserver' index properly due to all of the above indices.
Let me know what you think.
Jesse
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Fri Aug 07, 2015 2:59 pm
by bostoneng
Hi Jesse,
I'm assuming you are correct and the backups are probably corrupt.
You have been really great in supporting this issue, and I have no doubt that it would have worked if I had a healthy backup.
I can't spend any more time trying to troubleshoot this issue. I copied the inputs/outputs/filters .conf files and I'm going to start over.
Thanks again,
Kyle
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Posted: Mon Aug 10, 2015 9:17 am
by jolson
Kyle,
That sounds like a plan. Let me know if you need any assistance along the way. I'll lock this thread.
Jesse