Log Server in Red and out of sync

OptimusB · Post by **OptimusB** » Tue Mar 17, 2015 10:40 am

We ran out of space on our log server and on the repository. It was noticed a couple days later and once the space issue is sorted, one of the nodes start to grow in size again even though I've temporarily stopped the logstash service. After reviewing the Instance status, the storage on the nodes indicated that something is wrong with the sync between the nodes. Below are some screenshots and I also noticed a repeat of log entries in the elasticsearch logs. Any ideas how to get this back to a green state?

Code: Select all

[WARN ][cluster.action.shard     ] [16fcc224-849a-405f-bfaf-8321387b7294] [logstash-2015.03.13][3] received shard failed for [logstash-2015.03.13][3], node[M6qlZK3JSqKraQsHgQXXCw], [P], s[INITIALIZING], indexUUID [WNCQBWB2T021lNSG-DTWiQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[logstash-2015.03.13][3] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[logstash-2015.03.13][3] shard allocated for local recovery (post api), should exist, but doesn't, current files: ...

(I edited the hostnames in the screenshots)

instance.jpg

cluster.JPG

The unassigned shards count was over 80 yesterday, I've it over night and it is showing 4 now.....

jolson · Post by **jolson** » Tue Mar 17, 2015 11:07 am

From: http://www.elastic.co/guide/en/elastics ... ealth.html

On the shard level, a red status indicates that the specific shard is not allocated in the cluster, yellow means that the primary shard is allocated but replicas are not, and green means that all shards are allocated. The index level status is controlled by the worst shard status. The cluster status is controlled by the worst index status.

Since this is a cluster health alert, it is being triggered by the worst index - which in turn is triggered by the worst shard. We need to figure out which shard(s) are causing this issue.

You've provided some log output which is a great starting point. To see which shard are in the 'red' health status, let's run the following command:

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'

The results are going to be sloppy, but you will be able to find any shard with a status of 'red' and identify where they reside. Please post the results of the bad shards here, and we'll come up with a plan on how to get your cluster green again.

OptimusB · Post by **OptimusB** » Tue Mar 17, 2015 11:49 am

Think I found the "red", please see below.

Code: Select all

"logstash-2015.03.13":{"status":"red","number_of_shards":5,"number_of_replicas":1,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":4,"shards":{"0":{"status":"green","primary_active":true,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0},"1":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"2":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"3":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1},"4":{"status":"red","primary_active":false,"active_shards":0,"relocating_shards":0,"initializing_shards":1,"unassigned_shards":1}}},

Thanks!

FYI - The available storage on node 1 is steadily going down.... not sure what it is doing right now.

(If it makes it easier, I think we can lose the 2015.03.13 index if that's the cause)

jolson · Post by **jolson** » Tue Mar 17, 2015 12:31 pm

According to: http://www.elastic.co/guide/en/elastics ... ealth.html

initializing_shards is a count of shards that are being freshly created. For example, when you first create an index, the shards will all briefly reside in initializing state. This is typically a transient event, and shards shouldn’t linger in initializing too long. You may also see initializing shards when a node is first restarted: as shards are loaded from disk, they start as initializing.

It looks like some of your shards are still in the initialization phase, and I am unsure whether or not they will finish.

If you are fine with losing the information in 'logstash-2015.03.13', I recommend dropping the index as it appears to contain the shards in question. I recommend backing up the index first of course, and then dropping the index:
Backup the index:

Code: Select all

curator snapshot --most-recent 1 --prefix logstash-2015.03.13 --repository REPOSITORY_NAME

Removing the index:

Code: Select all

curl -XDELETE 'http://localhost:9200/logstash-2015.03.13'

After the removal, let me know how your health displays. Thank you!

We could also try assigning the shards to a node, but I would want to do more research about the implications of that first.

Best,

Jesse

OptimusB · Post by **OptimusB** » Tue Mar 17, 2015 1:03 pm

Thanks. I think it was definitely stuck. Although it would be nice to know how to handle this error in the future, since we are able to close and remove the index, I will attempt this and report back regarding the health status.

jolson · Post by **jolson** » Tue Mar 17, 2015 1:06 pm

If you have a good backup of the index, you could also restore it after the deletion is over for no loss of information.

I recommend monitoring your disk usage with a monitoring product. (I hear Nagios XI is pretty great

) The disk usage should never be allowed to fill or issues like this will happen - we are fortunate here in that it doesn't look like it did much damage.

OptimusB · Post by **OptimusB** » Tue Mar 17, 2015 8:38 pm

I was not able to get a good backup running. It basically got stuck. I closed the index and the status went green. However, looking at the instance status, there's a big discrepancy in storage. These should match?

storage.jpg

green.JPG

jolson · Post by **jolson** » Wed Mar 18, 2015 9:14 am

I am glad to hear that the health is back to green.

Is it possible that the shards from Logstash-2015.03.13 were 16GB in size? The shards were unassigned from their node - that could explain the space gap. I expect the space to balance out over time - please let us know right away if it does not.

Is there any chance that you have a working backup of Logstash-2015.03.13 that you could try restoring as mentioned above?

OptimusB · Post by **OptimusB** » Wed Mar 18, 2015 10:00 am

Thanks. I checked this morning and looks like it is balancing out a bit (39.5GB vs 30.3GB). Not 100% yet, but much better than yesterday. I lowered our retention and backup settings to 10 days for this cluster, and will keep an eye on it.

Unfortunately we don't have a good backup of 03.13, but I am ok with losing those logs. Like you mentioned, this really should've been monitored, which we have in our Nagios XI implementation (

it is a great product), however since this was still being semi-prod, I haven't enabled the alarms on it yet. Lesson learn.

Much appreciated for your help and direction.

jolson · Post by **jolson** » Wed Mar 18, 2015 10:04 am

Not a problem

I'm glad we got this taken care of. I'll lock this thread - if you have any further questions feel free to open a new one. Thanks!

Nagios Support Forum

Log Server in Red and out of sync

Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync

Re: Log Server in Red and out of sync