Page 1 of 1

Missing data after removing instance from cluster

Posted: Sun Aug 30, 2015 9:51 pm
by Envera IT
So I'm in the middle of a building move and I added my second NLS node to the primary NLS node at the second office connected via VPN. I expected a large data transfer to happen in the replication and that completed successfully. However I expected replications after that to use much less bandwidth, but that seems to have not been the case.
NLSReplication.PNG
When the second spike in bandwidth kicked off I got worried as I didn't want to impact production, so I had the sysadmin disable the NIC on the secondary NLS instance in vmware. I also deleted the instance from the cluster as I figured I'd just add it later on when we've moved all the infrastructure to the new building. But now I'm missing data in my graphs even though the indexes are open.
NLSMissingData.PNG
NLSIndexes.PNG
I'm posting to get answers to two questions: how do I get the data prior to 8/29 to show up in my queries/dashboards again, and spell out exactly how two nodes replicate data between each other.

As always, I appreciate any help I can get.

Re: Missing data after removing instance from cluster

Posted: Mon Aug 31, 2015 5:10 pm
by jolson
how do I get the data prior to 8/29 to show up in my queries/dashboards again
It's likely that at least part of this data exists on the second node that you had stood up. If you are willing to stand up the second node again to see whether or not your data reappears, that would be a worthwhile troubleshooting method.

I would also like to see the output of the following command:

Code: Select all

curl 'localhost:9200/_cluster/health?level=indices&pretty'
spell out exactly how two nodes replicate data between each other.
Nodes replicate data using sharding, which is a logical concept. You can read about how shards are distributed here:
https://www.elastic.co/guide/en/elastic ... tally.html

Essentially, shards are distributed among all of the instances in your Nagios Log Server cluster - those shards are watched over by your index. What likely happened is that *some shards* moved to your second server, while some remained on your primary server.

One last question: What was the latency like between your NLS nodes over the VPN connection? High latency is a very dangerous thing to introduce your instances to, and I'm wondering exactly what the conditions were like.

Thanks a ton!

Jesse

Re: Missing data after removing instance from cluster

Posted: Tue Sep 01, 2015 11:27 am
by Envera IT

Code: Select all

[root@nagiosls ~]# curl 'localhost:9200/_cluster/health?level=indices&pretty'
{
  "cluster_name" : "553c1f03-f76e-4910-a868-8c1e078ef969",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 161,
  "active_shards" : 161,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 171,
  "indices" : {
    "logstash-2015.08.10" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "nagioslogserver" : {
      "status" : "yellow",
      "number_of_shards" : 1,
      "number_of_replicas" : 1,
      "active_primary_shards" : 1,
      "active_shards" : 1,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 1
    },
    "logstash-2015.08.12" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.11" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.31" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.14" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.13" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.16" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.15" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.18" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.17" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.19" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.30" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "nagioslogserver_log" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.23" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.22" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.21" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.20" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.09" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.27" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.08" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.26" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.07" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.25" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.24" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.06" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.05" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.04" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.03" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.29" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.28" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 10
    },
    "logstash-2015.08.02" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.09.01" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "kibana-int" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    }
  }
}
[root@nagiosls ~]#
Latency is on average 23ms over the VPN connection. This is a temporary thing while we're in the middle of the office move. I'll try powering on the secondary node later tonight, I imagine you're right on and the data wqill show up again. Seems like an oversight on my part but was hoping the data still existed on the primary node and there was just a path error or something.

Re: Missing data after removing instance from cluster

Posted: Tue Sep 01, 2015 12:50 pm
by jolson
The only index that I'm seeing with a real problem is the one from 8/28:
"logstash-2015.08.28" : {
"status" : "red",
"number_of_shards" : 5,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
},
We'll see how your cluster reacts when the second server comes online. It's possible that the index will recover - it's also possible that it's corrupted. I hope it's the former!
Latency is on average 23ms over the VPN connection
This is an acceptable amount of latency. I would normally recommend that you get your servers in the same datacenter, but given that this setup is temporary I'm sure I don't have to tell you that. ;)

Thanks, looking forward to your results!

Re: Missing data after removing instance from cluster

Posted: Tue Sep 01, 2015 1:03 pm
by Envera IT
How does the replication work? Is it constant or is it on a schedule?

Re: Missing data after removing instance from cluster

Posted: Tue Sep 01, 2015 1:18 pm
by jolson
Replication takes place when:
1. New logs enter Nagios Log Server
2. A new instance joins a cluster
3. An instance is removed from a cluster

To expand on that:

1. When a new log enters Nagios Log Server, it is assigned to one of the five daily 'shards'. These five shards are distributed among your Nagios Log Server instances, and are moved around dynamically as elasticsearch deems necessary.

In addition to the 5 'Primary Shards', there are also 5 'Replica Shards' - which are exact duplicates of your Primary Shards. These replicas are distributed among all of the instances in your cluster in such a way that two matching shards will never be on the same instance. All of this moving in the backend happens dynamically and on no sort of schedule.

Example image of a 2 instance cluster:
shardthing.png
2. When a new instance of Nagios Log Server joins your cluster, shards will redistribute in such a way that the data and load is balanced between all of your nodes.

3. When an instance is removed from the cluster, a few things happen:
  • Any Replica Shard without a matching Primary shard is automatically upgraded to a Primary Shard, and then a new Replica Shard is generated and distributed appropriately.
    Any Primary Shard without a matching Replica shard has a new Replica shard generated and distributed appropriately.
Hopefully that answers your question. Thanks!