Log Server in Red and out of sync

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
OptimusB
Posts: 146
Joined: Mon Oct 27, 2014 10:08 pm
Location: Canada
Contact:

Log Server in Red and out of sync

Post by OptimusB »

We ran out of space on our log server and on the repository. It was noticed a couple days later and once the space issue is sorted, one of the nodes start to grow in size again even though I've temporarily stopped the logstash service. After reviewing the Instance status, the storage on the nodes indicated that something is wrong with the sync between the nodes. Below are some screenshots and I also noticed a repeat of log entries in the elasticsearch logs. Any ideas how to get this back to a green state?

Code: Select all

[WARN ][cluster.action.shard     ] [16fcc224-849a-405f-bfaf-8321387b7294] [logstash-2015.03.13][3] received shard failed for [logstash-2015.03.13][3], node[M6qlZK3JSqKraQsHgQXXCw], [P], s[INITIALIZING], indexUUID [WNCQBWB2T021lNSG-DTWiQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[logstash-2015.03.13][3] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[logstash-2015.03.13][3] shard allocated for local recovery (post api), should exist, but doesn't, current files: ...
(I edited the hostnames in the screenshots)
instance.jpg
cluster.JPG
The unassigned shards count was over 80 yesterday, I've it over night and it is showing 4 now.....
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server in Red and out of sync

Post by jolson »

From: http://www.elastic.co/guide/en/elastics ... ealth.html
On the shard level, a red status indicates that the specific shard is not allocated in the cluster, yellow means that the primary shard is allocated but replicas are not, and green means that all shards are allocated. The index level status is controlled by the worst shard status. The cluster status is controlled by the worst index status.
Since this is a cluster health alert, it is being triggered by the worst index - which in turn is triggered by the worst shard. We need to figure out which shard(s) are causing this issue.

You've provided some log output which is a great starting point. To see which shard are in the 'red' health status, let's run the following command:

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'
The results are going to be sloppy, but you will be able to find any shard with a status of 'red' and identify where they reside. Please post the results of the bad shards here, and we'll come up with a plan on how to get your cluster green again.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
OptimusB
Posts: 146
Joined: Mon Oct 27, 2014 10:08 pm
Location: Canada
Contact:

Re: Log Server in Red and out of sync

Post by OptimusB »

Think I found the "red", please see below.

Code: Select all

"logstash-2015.03.13":{"status":"red","number_of_shards":5,"number_of_replicas":1,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":4,"shards":{"0":{"status":"green","primary_active":true,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0},"1":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"2":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"3":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"4":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1}}},
Thanks!

FYI - The available storage on node 1 is steadily going down.... not sure what it is doing right now.

(If it makes it easier, I think we can lose the 2015.03.13 index if that's the cause)
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server in Red and out of sync

Post by jolson »

According to: http://www.elastic.co/guide/en/elastics ... ealth.html
initializing_shards is a count of shards that are being freshly created. For example, when you first create an index, the shards will all briefly reside in initializing state. This is typically a transient event, and shards shouldn’t linger in initializing too long. You may also see initializing shards when a node is first restarted: as shards are loaded from disk, they start as initializing.
It looks like some of your shards are still in the initialization phase, and I am unsure whether or not they will finish.

If you are fine with losing the information in 'logstash-2015.03.13', I recommend dropping the index as it appears to contain the shards in question. I recommend backing up the index first of course, and then dropping the index:
Backup the index:

Code: Select all

curator snapshot --most-recent 1 --prefix logstash-2015.03.13 --repository REPOSITORY_NAME
Removing the index:

Code: Select all

curl -XDELETE 'http://localhost:9200/logstash-2015.03.13'
After the removal, let me know how your health displays. Thank you!

We could also try assigning the shards to a node, but I would want to do more research about the implications of that first.

Best,


Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
OptimusB
Posts: 146
Joined: Mon Oct 27, 2014 10:08 pm
Location: Canada
Contact:

Re: Log Server in Red and out of sync

Post by OptimusB »

Thanks. I think it was definitely stuck. Although it would be nice to know how to handle this error in the future, since we are able to close and remove the index, I will attempt this and report back regarding the health status.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server in Red and out of sync

Post by jolson »

If you have a good backup of the index, you could also restore it after the deletion is over for no loss of information.

I recommend monitoring your disk usage with a monitoring product. (I hear Nagios XI is pretty great :D) The disk usage should never be allowed to fill or issues like this will happen - we are fortunate here in that it doesn't look like it did much damage.
You do not have the required permissions to view the files attached to this post.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
OptimusB
Posts: 146
Joined: Mon Oct 27, 2014 10:08 pm
Location: Canada
Contact:

Re: Log Server in Red and out of sync

Post by OptimusB »

I was not able to get a good backup running. It basically got stuck. I closed the index and the status went green. However, looking at the instance status, there's a big discrepancy in storage. These should match?
storage.jpg
green.JPG
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server in Red and out of sync

Post by jolson »

I am glad to hear that the health is back to green.

Is it possible that the shards from Logstash-2015.03.13 were 16GB in size? The shards were unassigned from their node - that could explain the space gap. I expect the space to balance out over time - please let us know right away if it does not.

Is there any chance that you have a working backup of Logstash-2015.03.13 that you could try restoring as mentioned above?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
OptimusB
Posts: 146
Joined: Mon Oct 27, 2014 10:08 pm
Location: Canada
Contact:

Re: Log Server in Red and out of sync

Post by OptimusB »

Thanks. I checked this morning and looks like it is balancing out a bit (39.5GB vs 30.3GB). Not 100% yet, but much better than yesterday. I lowered our retention and backup settings to 10 days for this cluster, and will keep an eye on it.

Unfortunately we don't have a good backup of 03.13, but I am ok with losing those logs. Like you mentioned, this really should've been monitored, which we have in our Nagios XI implementation ( :D it is a great product), however since this was still being semi-prod, I haven't enabled the alarms on it yet. Lesson learn.

Much appreciated for your help and direction.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server in Red and out of sync

Post by jolson »

Not a problem :D I'm glad we got this taken care of. I'll lock this thread - if you have any further questions feel free to open a new one. Thanks!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked