Nagios Log Server removed Nagiosadmin + shardexception error
Nagios Log Server removed Nagiosadmin + shardexception error
Hello,
I am experiencing the same problem in https://support.nagios.com/forum/viewto ... 38&t=31022
yet..the resolution does not fix my problem. Should we re-open that particular topic or continue with this new one?
Anyway just to catch you up, the "nagiosadmin" user has disappeared from the system. I can no longer log into the web interface.
I have followed the instructions to use the reset password script, doesn't fix the issue. I have tried using the curl command to add the user, yet it gives with a shardIT error:
curl -XPUT 'http://localhost:9200/nagioslogserver/user/2' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'
ERROR after timeout:
{"error":"UnavailableShardsException[[nagioslogserver][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: index {[nagioslogserver][user][2], source[{\"username\":\"someuser\",\"password\":\"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19\",\"auth_type\":\"admin\",\"email\":\"[email protected]\",\"language\":\"default\",\"apiaccess\":\"1\",\"apikey\":\"1396e08757545557073844695e5b64caa0bd3ad3\",\"created\":\"2015-01-23 10:00:00\",\"created_by\":0,\"default_dashboard\":\"/dashboard/elasticsearch/default\"}]}]","status":503}[root@logging etc]#
Is there a problem with the Elasticsearch index?
I am using a single instance, source install, CentOS 6.6 host.
Please help - this was working great.
I am experiencing the same problem in https://support.nagios.com/forum/viewto ... 38&t=31022
yet..the resolution does not fix my problem. Should we re-open that particular topic or continue with this new one?
Anyway just to catch you up, the "nagiosadmin" user has disappeared from the system. I can no longer log into the web interface.
I have followed the instructions to use the reset password script, doesn't fix the issue. I have tried using the curl command to add the user, yet it gives with a shardIT error:
curl -XPUT 'http://localhost:9200/nagioslogserver/user/2' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'
ERROR after timeout:
{"error":"UnavailableShardsException[[nagioslogserver][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: index {[nagioslogserver][user][2], source[{\"username\":\"someuser\",\"password\":\"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19\",\"auth_type\":\"admin\",\"email\":\"[email protected]\",\"language\":\"default\",\"apiaccess\":\"1\",\"apikey\":\"1396e08757545557073844695e5b64caa0bd3ad3\",\"created\":\"2015-01-23 10:00:00\",\"created_by\":0,\"default_dashboard\":\"/dashboard/elasticsearch/default\"}]}]","status":503}[root@logging etc]#
Is there a problem with the Elasticsearch index?
I am using a single instance, source install, CentOS 6.6 host.
Please help - this was working great.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
What happened prior to the user disappearing on your system? Did you perform an upgrade or any configuration changes?
Let's take a look at your existing indices:
Let's take a look at your existing indices:
Code: Select all
curl 'localhost:9200/_cluster/health?level=indices&pretty'Re: Nagios Log Server removed Nagiosadmin + shardexception e
I don't believe there was any upgrade or config changes when this was first noticed. However, this happened around the same time we ran out of space on the root partition of the server. I shut it down, extended the partition properly, and still could not log in.
I'm attaching the index output from your query to this post
I'm attaching the index output from your query to this post
You do not have the required permissions to view the files attached to this post.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Many of your indices are red, which means that they either cannot be assigned or are suffering from corruption. One such index is your 'nagioslogserver' index - which is used to preserve user login data.
I would like you to run the following commands to restore the 'nagioslogserver' index to a good working state.
Ensure that you have proper backups in place:
If you have no backups, do not continue.
Delete the 'nagioslogserver' index:
Change to the backups directory:
Untar one of your backups:
Note: xx-xx.xxxxxxxx will be a date during which a backup was taken.
Within the untarred directory will be several restore files - make note of the 'nagioslogserver.tar.gz' file.
Once you untar the backup, import the new 'nagioslogserver' index using your backup as a target:
Note: Replace /store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz with the full path of your 'nagioslogserver.tar.gz' restore file.
After the backup posts, you will be able to log in.
I would like you to run the following commands to restore the 'nagioslogserver' index to a good working state.
Ensure that you have proper backups in place:
Code: Select all
ls /store/backups/nagioslogserverDelete the 'nagioslogserver' index:
Code: Select all
curl -XDELETE "http://localhost:9200/nagioslogserver/"Code: Select all
cd /store/backups/nagioslogserverCode: Select all
tar zxvf nagioslogserver.2015-xx-xx.xxxxxxxx.tar.gzWithin the untarred directory will be several restore files - make note of the 'nagioslogserver.tar.gz' file.
Once you untar the backup, import the new 'nagioslogserver' index using your backup as a target:
Code: Select all
curl -XPOST "http://localhost:9200/nagioslogserver/_import?path=/store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz"After the backup posts, you will be able to log in.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
I followed your support and posted several backups that I had, each failing to log into the system. I even tried ones that weren't in tar.gz form yet, that were from a time that was well before things went bad... still not working.
Here is an example of the output that would show up when I would post (took about 10 seconds or so):
[root@logging nagioslogserver]# curl -XPOST "http://localhost:9200/nagioslogserver/_ ... ver.tar.gz"
{"running":true,"state":{"mode":"import","started":"2015-08-06T17:19:23.059Z","path":"file:///store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz","node_name":"26b83599-b0fc-46ad-b2ee-0e1d769292d1"}
Search for user:
[root@logging nagioslogserver]# curl "http://localhost:9200/nagioslogserver/u ... rch?pretty"
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
"status" : 503
Attaching indices output again to this post
Here is an example of the output that would show up when I would post (took about 10 seconds or so):
[root@logging nagioslogserver]# curl -XPOST "http://localhost:9200/nagioslogserver/_ ... ver.tar.gz"
{"running":true,"state":{"mode":"import","started":"2015-08-06T17:19:23.059Z","path":"file:///store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz","node_name":"26b83599-b0fc-46ad-b2ee-0e1d769292d1"}
Search for user:
[root@logging nagioslogserver]# curl "http://localhost:9200/nagioslogserver/u ... rch?pretty"
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
"status" : 503
Attaching indices output again to this post
You do not have the required permissions to view the files attached to this post.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
I think that your restoration attempts are getting stuck in the 'running' state.
If any states are stuck running, you can stop the elasticsearch process to kill them. I recommend stopping elasticsearch if that is the case:
Please note that elasticsearch will need to be shut down on all of your instances for the states to be cleared.
After elasticsearch has been fully show down, start it back up:
After elasticsearch has started back up, verify that there aren't any hanging states:
At this point, try the deletion and restoration of your index as per above again. Let me know if that helps. Thanks!
Jesse
You can check the amount of running states with the following command:{"running":true,"state"
Code: Select all
curl -XPOST 'http://localhost:9200/_export/state'Code: Select all
service elasticsearch stopAfter elasticsearch has been fully show down, start it back up:
Code: Select all
service elasticsearch startCode: Select all
curl -XPOST 'http://localhost:9200/_export/state'Jesse
Re: Nagios Log Server removed Nagiosadmin + shardexception e
From what I can tell - there isn't anything stuck, here is the execution and output:
curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}[
I'll tinker with elasticsearch running/not running with the restoration, but I have restarted that process several times in the troubleshooting of this myself...as well as, restarting the VM. This is only as single instance.
I executed the restoration and immediately after did your curl xpost for the state and it still shows up as a count of 0.
Starting to think this is hosed for good.
curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}[
I'll tinker with elasticsearch running/not running with the restoration, but I have restarted that process several times in the troubleshooting of this myself...as well as, restarting the VM. This is only as single instance.
I executed the restoration and immediately after did your curl xpost for the state and it still shows up as a count of 0.
Starting to think this is hosed for good.
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Just to make sure - you are removing the current index before you restore it, correct?
After you run the above removal command, verify that the index has gone away using the following command:
If the above command displays nothing, the index has been removed properly. Verify that the index is completely deleted before attempting to restore it. After your run the restore command, check for the index once again:
The output you're looking for is something like the following:
Code: Select all
curl -XDELETE "http://localhost:9200/nagioslogserver/"Code: Select all
curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'Code: Select all
curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' -A10| grep -v '_log'Note that the 'nagioslogserver' index is in a bad state of health right now - this is the index required to log into NLS. Running out of disk space must have corrupted the index, and I'm hoping that through the configuration backups we'll be able to recover it to a good working state."nagioslogserver" : {
"status" : "yellow",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
Re: Nagios Log Server removed Nagiosadmin + shardexception e
Code: Select all
[root@logging ~]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
{"acknowledged":true}[root@logging ~]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
"nagioslogserver" : {
Re: Nagios Log Server removed Nagiosadmin + shardexception e
The problem is that elasticsearch wants to generate a blank 'nagioslogserver' index if we don't do the delete/restore quickly enough.I have been using the XDELETE to remove the indices, however, this output leads me to believe that command isn't working for me since it is showing up with "nagioslogserver", correct?
I have generated the following script - please give it a run. It should delete your 'nagioslogserver' index and immediately replace it with the backup you have specified.
To run the script, simply place it somewhere on one of your Nagios Log Server nodes, and be sure to set the execute permissions appropriately:
Code: Select all
chmod +x restorenagioslogserver.shCode: Select all
ls -l /store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gzCode: Select all
./restorenagioslogserver.shBest,
Jesse
You do not have the required permissions to view the files attached to this post.