Page 1 of 1

Comms between cluster instances appears to be broken

Posted: Wed Mar 04, 2020 8:29 am
by jpconsilio
The instances in one of our two-instance clusters no longer to be in sync. On each instance, cluster status is YELLOW, and the number of active and unassigned shards is equal; # of instances = 1; # of data instances = 1
Likewise, Instance status for each instance shows data for the local instance and both logstash and elasticsearch are GREEN, but no statistics are displayed for the other instance and its' logstash and elasticsearch health indicators show as RED.
Disk space looks fine for both instances. Log searches yield no hits for a period from 2/24 up to 3/3.
Discovered trouble about sixteen hours ago. CLI confirmed services running on both nodes. Alternately bouncing each node brought clusters back into communication with shards being assigned properly (apparently). Eight hours later I'm back to where I was before.

Any guidance on troubleshooting this and restoring normal operations would be appreciated.

Re: Comms between cluster instances appears to be broken

Posted: Wed Mar 04, 2020 11:59 am
by jdunitz
Can you check the /usr/local/nagioslogserver/var/cluster_hosts file on both machines and make sure they look correct, i.e., both servers know about both servers? Because you're able to sync sometimes, this should be OK, but it's worth checking.

Also, would you be able to post or PM profiles from both systems?

Is it possible that you filled up your disk space at some point?

Let's start with these items, and see what we can figure out.

Thanks!

--Jeffrey

Re: Comms between cluster instances appears to be broken

Posted: Tue Mar 10, 2020 5:19 pm
by jpconsilio
Sure enough, the /user/local/nagioslogserver/var/cluster_hosts file no longer contained both servers. Adding their IPs back in and restarting seems to have sorted out the communications issues.

We had filled up the disk several months ago. After expanding it seemed to recover without further issue, though.
Thanks for your help!!

Re: Comms between cluster instances appears to be broken

Posted: Wed Mar 11, 2020 8:15 am
by scottwilkerson
jpconsilio wrote:Sure enough, the /user/local/nagioslogserver/var/cluster_hosts file no longer contained both servers. Adding their IPs back in and restarting seems to have sorted out the communications issues.

We had filled up the disk several months ago. After expanding it seemed to recover without further issue, though.
Thanks for your help!!
Great!

Locking thread