red cluster health on working cluster

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

red cluster health on working cluster

Post by benhank »

have 2 instances. One is the official VM, . The second is a manual install on a clean machine. Both were installed on centos 6 (clean no updates to the os) with the previous version then upgraded to the latest version of NLS.
My cluster seems to be working. By that I mean we have our environment set up to send logs to server P01 and the data being replicated with T01, which is working. However when I look at the cluster heath its red.
I have

Code: Select all

     service httpd stop
    service logstash stop
    service elasticsearch restart
    service logstash start
    service httpd startstop
I then went into

Code: Select all

Var/log/elasticsearch
and here is the logfile from my secondary machine:
c9dc126e-346d-4bfa-a30e-14b849c50ab5.log
and the primary:
c9dc126e-346d-4bfa-a30e-14b849c50ab5.log
Thanks in advance!
You do not have the required permissions to view the files attached to this post.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: red cluster health on working cluster

Post by jolson »

This is normally an indication of failing indices. Let's take a look at your shard health. The output isn't pretty, but it gives us a good idea of what's going on:

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: red cluster health on working cluster

Post by benhank »

here we go:
Document.rtf
You do not have the required permissions to view the files attached to this post.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: red cluster health on working cluster

Post by jolson »

It looks like you have a bad index by the name of logstash-2015.05.05. The course of action here will be to remove that index and see if your cluster health recovers. Keep in mind that all log data from that day will be lost - you can always restore if you have a backup present.

Code: Select all

curl -XDELETE 'http://localhost:9200/logstash-2015.05.05/'
How is your cluster health after that removal?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: red cluster health on working cluster

Post by benhank »

lean and green my man thanks! any clue as to how that happened?

better question:
i only deleted that on one machine. shouldn't the other one detect that it was delete it and rebuild it?
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: red cluster health on working cluster

Post by jolson »

i only deleted that on one machine. shouldn't the other one detect that it was delete it and rebuild it?
Nope - the index contains both primary and replica shards, so all of the data is now gone unfortunately. The API that we used can affect the whole cluster, not just the node in question.

There are many reasons why an index might corrupt. Some of the more common reasons are disk space filling up or shards being unable initialize properly (for whatever reason). Typically bad shards indicate data loss, so somehow data was likely lost on the server - the culprit is hard to pin down. I would get a backup schedule in place so that you have a way to recover your information if this were to happen again. Protect those bits! :)
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: red cluster health on working cluster

Post by benhank »

Thanks for the help and info! all set my man!
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: red cluster health on working cluster

Post by jolson »

Glad I could help - I'll close this out. Please feel free to open additional thread if you have further questions or issues. Thanks!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked