Nagios Log Server removed Nagiosadmin + shardexception error

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
bostoneng
Posts: 8
Joined: Wed Aug 05, 2015 12:28 pm

Nagios Log Server removed Nagiosadmin + shardexception error

Post by bostoneng »

Hello,

I am experiencing the same problem in https://support.nagios.com/forum/viewto ... 38&t=31022

yet..the resolution does not fix my problem. Should we re-open that particular topic or continue with this new one?

Anyway just to catch you up, the "nagiosadmin" user has disappeared from the system. I can no longer log into the web interface.

I have followed the instructions to use the reset password script, doesn't fix the issue. I have tried using the curl command to add the user, yet it gives with a shardIT error:

curl -XPUT 'http://localhost:9200/nagioslogserver/user/2' -d '{"username":"someuser","password":"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19","auth_type":"admin","email":"[email protected]","language":"default","apiaccess":"1","apikey":"1396e08757545557073844695e5b64caa0bd3ad3","created":"2015-01-23 10:00:00","created_by":0,"default_dashboard":"/dashboard/elasticsearch/default"}'

ERROR after timeout:
{"error":"UnavailableShardsException[[nagioslogserver][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: index {[nagioslogserver][user][2], source[{\"username\":\"someuser\",\"password\":\"c678bcf3b5138b9263a95c44d28097f22c2e02877193d2c25313478821d45c19\",\"auth_type\":\"admin\",\"email\":\"[email protected]\",\"language\":\"default\",\"apiaccess\":\"1\",\"apikey\":\"1396e08757545557073844695e5b64caa0bd3ad3\",\"created\":\"2015-01-23 10:00:00\",\"created_by\":0,\"default_dashboard\":\"/dashboard/elasticsearch/default\"}]}]","status":503}[root@logging etc]#

Is there a problem with the Elasticsearch index?
I am using a single instance, source install, CentOS 6.6 host.

Please help - this was working great.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by jolson »

What happened prior to the user disappearing on your system? Did you perform an upgrade or any configuration changes?

Let's take a look at your existing indices:

Code: Select all

curl 'localhost:9200/_cluster/health?level=indices&pretty'
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
bostoneng
Posts: 8
Joined: Wed Aug 05, 2015 12:28 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by bostoneng »

I don't believe there was any upgrade or config changes when this was first noticed. However, this happened around the same time we ran out of space on the root partition of the server. I shut it down, extended the partition properly, and still could not log in.

I'm attaching the index output from your query to this post
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by jolson »

Many of your indices are red, which means that they either cannot be assigned or are suffering from corruption. One such index is your 'nagioslogserver' index - which is used to preserve user login data.

I would like you to run the following commands to restore the 'nagioslogserver' index to a good working state.

Ensure that you have proper backups in place:

Code: Select all

ls /store/backups/nagioslogserver
If you have no backups, do not continue.

Delete the 'nagioslogserver' index:

Code: Select all

curl -XDELETE "http://localhost:9200/nagioslogserver/"
Change to the backups directory:

Code: Select all

cd /store/backups/nagioslogserver
Untar one of your backups:

Code: Select all

tar zxvf nagioslogserver.2015-xx-xx.xxxxxxxx.tar.gz
Note: xx-xx.xxxxxxxx will be a date during which a backup was taken.
Within the untarred directory will be several restore files - make note of the 'nagioslogserver.tar.gz' file.

Once you untar the backup, import the new 'nagioslogserver' index using your backup as a target:

Code: Select all

curl -XPOST "http://localhost:9200/nagioslogserver/_import?path=/store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz"
Note: Replace /store/backups/nagioslogserver/nagioslogserver.2015-08-05.1438811717/nagioslogserver.tar.gz with the full path of your 'nagioslogserver.tar.gz' restore file.

After the backup posts, you will be able to log in.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
bostoneng
Posts: 8
Joined: Wed Aug 05, 2015 12:28 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by bostoneng »

I followed your support and posted several backups that I had, each failing to log into the system. I even tried ones that weren't in tar.gz form yet, that were from a time that was well before things went bad... still not working.

Here is an example of the output that would show up when I would post (took about 10 seconds or so):
[root@logging nagioslogserver]# curl -XPOST "http://localhost:9200/nagioslogserver/_ ... ver.tar.gz"
{"running":true,"state":{"mode":"import","started":"2015-08-06T17:19:23.059Z","path":"file:///store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz","node_name":"26b83599-b0fc-46ad-b2ee-0e1d769292d1"}

Search for user:
[root@logging nagioslogserver]# curl "http://localhost:9200/nagioslogserver/u ... rch?pretty"
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
"status" : 503


Attaching indices output again to this post
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by jolson »

I think that your restoration attempts are getting stuck in the 'running' state.
{"running":true,"state"
You can check the amount of running states with the following command:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/state'
If any states are stuck running, you can stop the elasticsearch process to kill them. I recommend stopping elasticsearch if that is the case:

Code: Select all

service elasticsearch stop
Please note that elasticsearch will need to be shut down on all of your instances for the states to be cleared.

After elasticsearch has been fully show down, start it back up:

Code: Select all

service elasticsearch start
After elasticsearch has started back up, verify that there aren't any hanging states:

Code: Select all

curl -XPOST 'http://localhost:9200/_export/state'
At this point, try the deletion and restoration of your index as per above again. Let me know if that helps. Thanks!


Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
bostoneng
Posts: 8
Joined: Wed Aug 05, 2015 12:28 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by bostoneng »

From what I can tell - there isn't anything stuck, here is the execution and output:

curl -XPOST 'http://localhost:9200/_export/state'
{"count":0,"states":[]}[

I'll tinker with elasticsearch running/not running with the restoration, but I have restarted that process several times in the troubleshooting of this myself...as well as, restarting the VM. This is only as single instance.

I executed the restoration and immediately after did your curl xpost for the state and it still shows up as a count of 0.

Starting to think this is hosed for good.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by jolson »

Just to make sure - you are removing the current index before you restore it, correct?

Code: Select all

curl -XDELETE "http://localhost:9200/nagioslogserver/"
After you run the above removal command, verify that the index has gone away using the following command:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
If the above command displays nothing, the index has been removed properly. Verify that the index is completely deleted before attempting to restore it. After your run the restore command, check for the index once again:

Code: Select all

curl -s 'localhost:9200/_cluster/health?level=indices&pretty'  | grep 'nagioslogserver' -A10| grep -v '_log'
The output you're looking for is something like the following:
"nagioslogserver" : {
"status" : "yellow",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1
Note that the 'nagioslogserver' index is in a bad state of health right now - this is the index required to log into NLS. Running out of disk space must have corrupted the index, and I'm hoping that through the configuration backups we'll be able to recover it to a good working state.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
bostoneng
Posts: 8
Joined: Wed Aug 05, 2015 12:28 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by bostoneng »

Code: Select all

[root@logging ~]# curl -XDELETE "http://localhost:9200/nagioslogserver/"
{"acknowledged":true}[root@logging ~]# curl -s 'localhost:9200/_cluster/health?level=indices&pretty' | grep 'nagioslogserver' | grep -v '_log'
    "nagioslogserver" : {

I have been using the XDELETE to remove the indices, however, this output leads me to believe that command isn't working for me since it is showing up with "nagioslogserver", correct? Is there any other way to try to remove this?
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Nagios Log Server removed Nagiosadmin + shardexception e

Post by jolson »

I have been using the XDELETE to remove the indices, however, this output leads me to believe that command isn't working for me since it is showing up with "nagioslogserver", correct?
The problem is that elasticsearch wants to generate a blank 'nagioslogserver' index if we don't do the delete/restore quickly enough.

I have generated the following script - please give it a run. It should delete your 'nagioslogserver' index and immediately replace it with the backup you have specified.

To run the script, simply place it somewhere on one of your Nagios Log Server nodes, and be sure to set the execute permissions appropriately:

Code: Select all

chmod +x restorenagioslogserver.sh
Be sure that your backup file exists:

Code: Select all

ls -l /store/backups/nagioslogserver/nagioslogserver.2015-07-26.1437928341/nagioslogserver.tar.gz
Run the script:

Code: Select all

./restorenagioslogserver.sh
If the script fails the first time and you still cannot log in, run it again. I had to run it twice before it worked for me.

Best,


Jesse
You do not have the required permissions to view the files attached to this post.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked