4 node cluster problem

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

4 node cluster problem

Post by SteveBeauchemin »

While my back was turned for a few days, one of my systems decided to drop from the cluster.

I have since stopped logstash and elasticsearch, let it settle, and restarted them on the node. It now thinks it is a one node cluster. I have been waiting for the 4 systems to start chatting again but they are not. The remaining 3 nodes are happy and think they are a 3 node cluster.

So, how do I make the 4th node rejoin? It has been offline long enough for the other 3 systems to clear up and become green.

Is this a delete and add thing?

Please advise. (more info below)

Thanks

Steve B

Firewall ports are open

Code: Select all

firewall-cmd --list-all
  ports: 80/tcp 443/tcp 3515/tcp 2056/tcp 2057/tcp 
         9300-9400/tcp 514/tcp 514/udp 1514/tcp 1514/udp 
         5544/tcp 5544/udp 3306/tcp 4444/tcp 4567/tcp 
         4567/udp 4568/tcp 5142/tcp
3 node cluster health status

Code: Select all

curl http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "79e8bf76-674f-4ecd-8741-27a3587a3f39",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 245,
  "active_shards" : 490,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}
1 node cluster health status

Code: Select all

curl http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "79e8bf76-674f-4ecd-8741-27a3587a3f39",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 130,
  "active_shards" : 130,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 380,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: 4 node cluster problem

Post by cdienger »

Run this on the one node cluster:

Code: Select all

curl 'localhost:9200/_cat/shards?pretty'
a bad index could be preventing it from being added back.

Is there anything in the /var/log/elasticsearch/<cluster_uuid>.log files?

Check that all the machines can communicate with each other over port 9300:

Code: Select all

telnet nls1 9300
telnet nls2 9300
telnet nls3 9300
telnet nls4 9300
Also check /usr/local/nagioslogserver/var/cluster_hosts on all machines. It should contain IPs or hostnames of all machines expected to be in the cluster and /usr/local/nagioslogserver/var/cluster_uuid should have the same uuid in it across all machines.

Delete and add may do the trick but wouldn't be my go to option since whatever caused this to occur may still be an issue and prevent adding the server back.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: 4 node cluster problem

Post by SteveBeauchemin »

I looked at /usr/local/nagioslogserver/var/cluster_hosts first and sure enough it was not the same on the one system. I updated it to match the other 3, restarted elasticsearch, and it now says I have a 4 node cluster.

This could have been wrong for a long time as I had seen strange cluster stuff before where I stopped and started to get cluster healthy.

We are starting to use Log server much more now as the IT Teams see what we are providing for them. So this is good to fix that file.

Thank you. All is well. And if not, I'll post a new issue.

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
Locked