Page 1 of 1

4 node cluster problem

Posted: Wed Sep 04, 2019 10:30 am
by SteveBeauchemin
While my back was turned for a few days, one of my systems decided to drop from the cluster.

I have since stopped logstash and elasticsearch, let it settle, and restarted them on the node. It now thinks it is a one node cluster. I have been waiting for the 4 systems to start chatting again but they are not. The remaining 3 nodes are happy and think they are a 3 node cluster.

So, how do I make the 4th node rejoin? It has been offline long enough for the other 3 systems to clear up and become green.

Is this a delete and add thing?

Please advise. (more info below)

Thanks

Steve B

Firewall ports are open

Code: Select all

firewall-cmd --list-all
  ports: 80/tcp 443/tcp 3515/tcp 2056/tcp 2057/tcp 
         9300-9400/tcp 514/tcp 514/udp 1514/tcp 1514/udp 
         5544/tcp 5544/udp 3306/tcp 4444/tcp 4567/tcp 
         4567/udp 4568/tcp 5142/tcp
3 node cluster health status

Code: Select all

curl http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "79e8bf76-674f-4ecd-8741-27a3587a3f39",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 245,
  "active_shards" : 490,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}
1 node cluster health status

Code: Select all

curl http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "79e8bf76-674f-4ecd-8741-27a3587a3f39",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 130,
  "active_shards" : 130,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 380,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Re: 4 node cluster problem

Posted: Wed Sep 04, 2019 10:49 am
by cdienger
Run this on the one node cluster:

Code: Select all

curl 'localhost:9200/_cat/shards?pretty'
a bad index could be preventing it from being added back.

Is there anything in the /var/log/elasticsearch/<cluster_uuid>.log files?

Check that all the machines can communicate with each other over port 9300:

Code: Select all

telnet nls1 9300
telnet nls2 9300
telnet nls3 9300
telnet nls4 9300
Also check /usr/local/nagioslogserver/var/cluster_hosts on all machines. It should contain IPs or hostnames of all machines expected to be in the cluster and /usr/local/nagioslogserver/var/cluster_uuid should have the same uuid in it across all machines.

Delete and add may do the trick but wouldn't be my go to option since whatever caused this to occur may still be an issue and prevent adding the server back.

Re: 4 node cluster problem

Posted: Wed Sep 04, 2019 11:13 am
by SteveBeauchemin
I looked at /usr/local/nagioslogserver/var/cluster_hosts first and sure enough it was not the same on the one system. I updated it to match the other 3, restarted elasticsearch, and it now says I have a 4 node cluster.

This could have been wrong for a long time as I had seen strange cluster stuff before where I stopped and started to get cluster healthy.

We are starting to use Log server much more now as the IT Teams see what we are providing for them. So this is good to fix that file.

Thank you. All is well. And if not, I'll post a new issue.

Steve B