Nagios Support Forum • Nagios Log Server removed Nagiosadmin + shardexception error

Page 1 of 2

Nagios Log Server removed Nagiosadmin + shardexception error

Posted: Wed Aug 05, 2015 3:42 pm

by bostoneng

Hello,

I am experiencing the same problem in https://support.nagios.com/forum/viewto ... 38&t=31022

yet..the resolution does not fix my problem. Should we re-open that particular topic or continue with this new one?

Anyway just to catch you up, the "nagiosadmin" user has disappeared from the system. I can no longer log into the web interface.

I have followed the instructions to use the reset password script, doesn't fix the issue. I have tried using the curl command to add the user, yet it gives with a shardIT error:

curl -XPUT 'http://localhost:9200/nagioslogserver/user/2' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'

ERROR after timeout:
{"error":"UnavailableShardsException[[nagioslogserver][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: index {[nagioslogserver][user][2], source[{\"username\":\"someuser\",\"password\":\"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19\",\"auth_type\":\"admin\",\"email\":\"[email protected]\",\"language\":\"default\",\"apiaccess\":\"1\",\"apikey\":\"1396e08757545557073844695e5b64caa0bd3ad3\",\"created\":\"2015-01-23 10:00:00\",\"created_by\":0,\"default_dashboard\":\"/dashboard/elasticsearch/default\"}]}]","status":503}[root@logging etc]#

Is there a problem with the Elasticsearch index?
I am using a single instance, source install, CentOS 6.6 host.

Please help - this was working great.

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Wed Aug 05, 2015 3:48 pm

by jolson

What happened prior to the user disappearing on your system? Did you perform an upgrade or any configuration changes?

Let's take a look at your existing indices:

Code: Select all

curl 'localhost:9200/_cluster/health?level=indices&pretty'

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 8:40 am

by bostoneng

I don't believe there was any upgrade or config changes when this was first noticed. However, this happened around the same time we ran out of space on the root partition of the server. I shut it down, extended the partition properly, and still could not log in.

I'm attaching the index output from your query to this post

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 9:32 am

by jolson

Many of your indices are red, which means that they either cannot be assigned or are suffering from corruption. One such index is your 'nagioslogserver' index - which is used to preserve user login data.

I would like you to run the following commands to restore the 'nagioslogserver' index to a good working state.

Ensure that you have proper backups in place:

Code: Select all

ls /store/backups/nagioslogserver

If you have no backups, do not continue.

Delete the 'nagioslogserver' index:

Code: Select all

curl -XDELETE "http://localhost:9200/nagioslogserver/"

Change to the backups directory:

Code: Select all

cd /store/backups/nagioslogserver

Untar one of your backups:

Code: Select all

tar zxvf nagioslogserver.2015-xx-xx.xxxxxxxx.tar.gz

Note: xx-xx.xxxxxxxx will be a date during which a backup was taken.
Within the untarred directory will be several restore files - make note of the 'nagioslogserver.tar.gz' file.

Once you untar the backup, import the new 'nagioslogserver' index using your backup as a target:

Code: Select all

curl -XPOST "http://localhost:9200/nagioslogserver/_import?path=/store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz"

Note: Replace /store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz with the full path of your 'nagioslogserver.tar.gz' restore file.

After the backup posts, you will be able to log in.

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 12:59 pm

by bostoneng

I followed your support and posted several backups that I had, each failing to log into the system. I even tried ones that weren't in tar.gz form yet, that were from a time that was well before things went bad... still not working.

Here is an example of the output that would show up when I would post (took about 10 seconds or so):
[root@logging nagioslogserver]# curl -XPOST "http://localhost:9200/nagioslogserver/_ ... ver.tar.gz"
{"running":true,"state":{"mode":"import","started":"2015-08-06T17:19:23.059Z","path":"file:///store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz","node_name":"26b83599-b0fc-46ad-b2ee-0e1d769292d1"}

Search for user:
[root@logging nagioslogserver]# curl "http://localhost:9200/nagioslogserver/u ... rch?pretty"
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
"status" : 503

Attaching indices output again to this post

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 1:31 pm

by jolson

I think that your restoration attempts are getting stuck in the 'running' state.

{"running":true,"state"

You can check the amount of running states with the following command:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/state'

If any states are stuck running, you can stop the elasticsearch process to kill them. I recommend stopping elasticsearch if that is the case:

Code: Select all

service elasticsearch stop

Please note that elasticsearch will need to be shut down on all of your instances for the states to be cleared.

After elasticsearch has been fully show down, start it back up:

Code: Select all

service elasticsearch start

After elasticsearch has started back up, verify that there aren't any hanging states:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/state'

At this point, try the deletion and restoration of your index as per above again. Let me know if that helps. Thanks!

Jesse

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 3:52 pm

by bostoneng

From what I can tell - there isn't anything stuck, here is the execution and output:

curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}[

I'll tinker with elasticsearch running/not running with the restoration, but I have restarted that process several times in the troubleshooting of this myself...as well as, restarting the VM. This is only as single instance.

I executed the restoration and immediately after did your curl xpost for the state and it still shows up as a count of 0.

Starting to think this is hosed for good.

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Thu Aug 06, 2015 4:31 pm

by jolson

Just to make sure - you are removing the current index before you restore it, correct?

Code: Select all

curl -XDELETE "http://localhost:9200/nagioslogserver/"

After you run the above removal command, verify that the index has gone away using the following command:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'

If the above command displays nothing, the index has been removed properly. Verify that the index is completely deleted before attempting to restore it. After your run the restore command, check for the index once again:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty'  | grep 'nagioslogserver' -A10| grep -v '_log'

The output you're looking for is something like the following:

"nagioslogserver" : {
"status" : "yellow",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1

Note that the 'nagioslogserver' index is in a bad state of health right now - this is the index required to log into NLS. Running out of disk space must have corrupted the index, and I'm hoping that through the configuration backups we'll be able to recover it to a good working state.

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Fri Aug 07, 2015 7:29 am

by bostoneng

Code: Select all

[root@logging ~]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
{"acknowledged":true}[root@logging ~]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
    "nagioslogserver" : {

I have been using the XDELETE to remove the indices, however, this output leads me to believe that command isn't working for me since it is showing up with "nagioslogserver", correct? Is there any other way to try to remove this?

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Posted: Fri Aug 07, 2015 9:57 am

by jolson

I have been using the XDELETE to remove the indices, however, this output leads me to believe that command isn't working for me since it is showing up with "nagioslogserver", correct?

The problem is that elasticsearch wants to generate a blank 'nagioslogserver' index if we don't do the delete/restore quickly enough.

I have generated the following script - please give it a run. It should delete your 'nagioslogserver' index and immediately replace it with the backup you have specified.

To run the script, simply place it somewhere on one of your Nagios Log Server nodes, and be sure to set the execute permissions appropriately:

Code: Select all

chmod +x restorenagioslogserver.sh

Be sure that your backup file exists:

Code: Select all

ls -l /store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz

Run the script:

Code: Select all

./restorenagioslogserver.sh

If the script fails the first time and you still cannot log in, run it again. I had to run it twice before it worked for me.

Best,

Jesse