Missing data after removing instance from cluster

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
Envera IT
Posts: 159
Joined: Wed Jun 19, 2013 10:21 am

Missing data after removing instance from cluster

Post by Envera IT »

So I'm in the middle of a building move and I added my second NLS node to the primary NLS node at the second office connected via VPN. I expected a large data transfer to happen in the replication and that completed successfully. However I expected replications after that to use much less bandwidth, but that seems to have not been the case.
NLSReplication.PNG
When the second spike in bandwidth kicked off I got worried as I didn't want to impact production, so I had the sysadmin disable the NIC on the secondary NLS instance in vmware. I also deleted the instance from the cluster as I figured I'd just add it later on when we've moved all the infrastructure to the new building. But now I'm missing data in my graphs even though the indexes are open.
NLSMissingData.PNG
NLSIndexes.PNG
I'm posting to get answers to two questions: how do I get the data prior to 8/29 to show up in my queries/dashboards again, and spell out exactly how two nodes replicate data between each other.

As always, I appreciate any help I can get.
You do not have the required permissions to view the files attached to this post.
I like graphs...
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Missing data after removing instance from cluster

Post by jolson »

how do I get the data prior to 8/29 to show up in my queries/dashboards again
It's likely that at least part of this data exists on the second node that you had stood up. If you are willing to stand up the second node again to see whether or not your data reappears, that would be a worthwhile troubleshooting method.

I would also like to see the output of the following command:

Code: Select all

curl 'localhost:9200/_cluster/health?level=indices&pretty'
spell out exactly how two nodes replicate data between each other.
Nodes replicate data using sharding, which is a logical concept. You can read about how shards are distributed here:
https://www.elastic.co/guide/en/elastic ... tally.html

Essentially, shards are distributed among all of the instances in your Nagios Log Server cluster - those shards are watched over by your index. What likely happened is that *some shards* moved to your second server, while some remained on your primary server.

One last question: What was the latency like between your NLS nodes over the VPN connection? High latency is a very dangerous thing to introduce your instances to, and I'm wondering exactly what the conditions were like.

Thanks a ton!

Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Envera IT
Posts: 159
Joined: Wed Jun 19, 2013 10:21 am

Re: Missing data after removing instance from cluster

Post by Envera IT »

Code: Select all

[root@nagiosls ~]# curl 'localhost:9200/_cluster/health?level=indices&pretty'
{
  "cluster_name" : "553c1f03-f76e-4910-a868-8c1e078ef969",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 161,
  "active_shards" : 161,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 171,
  "indices" : {
    "logstash-2015.08.10" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "nagioslogserver" : {
      "status" : "yellow",
      "number_of_shards" : 1,
      "number_of_replicas" : 1,
      "active_primary_shards" : 1,
      "active_shards" : 1,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 1
    },
    "logstash-2015.08.12" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.11" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.31" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.14" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.13" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.16" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.15" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.18" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.17" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.19" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.30" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "nagioslogserver_log" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.23" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.22" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.21" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.20" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.09" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.27" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.08" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.26" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.07" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.25" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.24" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.06" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.05" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.04" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.03" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.29" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.08.28" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 10
    },
    "logstash-2015.08.02" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "logstash-2015.09.01" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    },
    "kibana-int" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 5,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 5
    }
  }
}
[root@nagiosls ~]#
Latency is on average 23ms over the VPN connection. This is a temporary thing while we're in the middle of the office move. I'll try powering on the secondary node later tonight, I imagine you're right on and the data wqill show up again. Seems like an oversight on my part but was hoping the data still existed on the primary node and there was just a path error or something.
I like graphs...
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Missing data after removing instance from cluster

Post by jolson »

The only index that I'm seeing with a real problem is the one from 8/28:
"logstash-2015.08.28" : {
"status" : "red",
"number_of_shards" : 5,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
},
We'll see how your cluster reacts when the second server comes online. It's possible that the index will recover - it's also possible that it's corrupted. I hope it's the former!
Latency is on average 23ms over the VPN connection
This is an acceptable amount of latency. I would normally recommend that you get your servers in the same datacenter, but given that this setup is temporary I'm sure I don't have to tell you that. ;)

Thanks, looking forward to your results!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Envera IT
Posts: 159
Joined: Wed Jun 19, 2013 10:21 am

Re: Missing data after removing instance from cluster

Post by Envera IT »

How does the replication work? Is it constant or is it on a schedule?
I like graphs...
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Missing data after removing instance from cluster

Post by jolson »

Replication takes place when:
1. New logs enter Nagios Log Server
2. A new instance joins a cluster
3. An instance is removed from a cluster

To expand on that:

1. When a new log enters Nagios Log Server, it is assigned to one of the five daily 'shards'. These five shards are distributed among your Nagios Log Server instances, and are moved around dynamically as elasticsearch deems necessary.

In addition to the 5 'Primary Shards', there are also 5 'Replica Shards' - which are exact duplicates of your Primary Shards. These replicas are distributed among all of the instances in your cluster in such a way that two matching shards will never be on the same instance. All of this moving in the backend happens dynamically and on no sort of schedule.

Example image of a 2 instance cluster:
shardthing.png
2. When a new instance of Nagios Log Server joins your cluster, shards will redistribute in such a way that the data and load is balanced between all of your nodes.

3. When an instance is removed from the cluster, a few things happen:
  • Any Replica Shard without a matching Primary shard is automatically upgraded to a Primary Shard, and then a new Replica Shard is generated and distributed appropriately.
    Any Primary Shard without a matching Replica shard has a new Replica shard generated and distributed appropriately.
Hopefully that answers your question. Thanks!
You do not have the required permissions to view the files attached to this post.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked