Cluster 2nd Node OFF

teirekos · Post by **teirekos** » Thu Mar 05, 2015 2:55 am

Node A

[root@NagiosLogServer elasticsearch]# curl -XGET 'http://127.0.0.1:9200/?pretty'
{
  "status" : 200,
  "name" : "1048634e-2f8f-4ec5-9432-edba342d51dd",
  "version" : {
    "number" : "1.3.2",
    "build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f",
    "build_timestamp" : "2014-08-13T14:29:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Node B

Code: Select all

[root@NagiosLogServer2 logstash]# curl -XGET 'http://127.0.0.1:9200/?pretty'
{
  "status" : 200,
  "name" : "845bc07c-ed91-4920-8e23-747c9cc699f5",
  "version" : {
    "number" : "1.3.2",
    "build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f",
    "build_timestamp" : "2014-08-13T14:29:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

cmerchant · Post by **cmerchant** » Thu Mar 05, 2015 4:43 pm

Are we seeing the results of these queries with 2nd node is connected?

Node A

Code: Select all

[root@NagiosLogServer elasticsearch]# curl -XGET 'http://127.0.0.1:9200/?pretty'
{
"status" : 200,
"name" : "1048634e-2f8f-4ec5-9432-edba342d51dd",
"version" : {
"number" : "1.3.2",
"build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f",
"build_timestamp" : "2014-08-13T14:29:30Z",
"build_snapshot" : false,
"lucene_version" : "4.9"
},
"tagline" : "You Know, for Search"
}

Node B

Code: Select all

[root@NagiosLogServer2 logstash]# curl -XGET 'http://127.0.0.1:9200/?pretty'
{
"status" : 200,
"name" : "845bc07c-ed91-4920-8e23-747c9cc699f5",
"version" : {
"number" : "1.3.2",
"build_hash" : "dee175dbe2f254f3f26992f5d7591939aaefd12f",
"build_timestamp" : "2014-08-13T14:29:30Z",
"build_snapshot" : false,
"lucene_version" : "4.9"
},
"tagline" : "You Know, for Search"
}

Also, have you modified the permissions for logstash to allow access to privileged port 514?

jolson · Post by **jolson** » Thu Mar 05, 2015 4:44 pm

Would you please collect some Elasticsearch logs for us? Run the following on both nodes:

Code: Select all

tar czfv elasticsearchlogs.tgz /var/log/elasticsearch/

Please upload the resulting files.

Also, do you know the time period that the disconnect may have happened during?

teirekos · Post by **teirekos** » Fri Mar 06, 2015 4:25 am

elasticsearchlogs_1.tgz from my 1st NodeA
elasticsearchlogs_2.tgz from my 2nd NodeB

Last time I rebooted both nodes after a few hours the cluster "broke" i.e. Cluster Status Yellow with unassigned shard and in the Instance Status the other node has "!".

scottwilkerson · Post by **scottwilkerson** » Fri Mar 06, 2015 12:18 pm

teirekos,

Lets make the following change to your elasticsearch configuration /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml

On each instance change this

Code: Select all

# discovery.zen.minimum_master_nodes: 1

To this

Code: Select all

discovery.zen.minimum_master_nodes: 2

Then lets restart elasticsearch on each instance

Code: Select all

service elasticsearch restart

teirekos · Post by **teirekos** » Mon Mar 09, 2015 9:34 am

I did exactly what you instructed me. It is ok for now (but this is always the case after a restart).
We 'll have to wait and see... I'll send feedback.
Thanx a lot.

teirekos · Post by **teirekos** » Mon Mar 09, 2015 11:14 am

same problem after a few hours after the restart. I attach the latest elasticsearch logs...

jolson · Post by **jolson** » Mon Mar 09, 2015 2:54 pm

I cannot see anything in the logs that leads to an obvious error. Would it be alright if you turned the logging level up and reproduce the issue once more?

Code: Select all

vi /usr/local/nagioslogserver/elasticsearch/config/logging.yml

Change "es.logger.level: INFO" to es.logger.level: DEBUG". Once changed, restart both nodes.
After the nodes have disconnected again, upload your log files using the same method as before.

Also, if you could run the following command when you notice high CPU usage, it could be helpful:

Code: Select all

curl -XGET localhost:9200/_nodes/hot_threads

teirekos · Post by **teirekos** » Tue Mar 10, 2015 10:04 am

I've changed the log level to DEBUG and rebooted the servers. For now the cluster seems to be ok (we'll have to wait though).
I was expecting a large amount of logs in the debug level but this is not the case! Also in the logstash log I get the "not part of the cluster" WARN. (I attach the open logs from both nodes).
Another strange thing was that after 2 reboots in node A the logstash process didn't start so I had to start it manually.
Since the cluster was down I had unassigned shards. After the reboot the shards were "synchronized" but now only 1 shard is left as unassigned thus the Cluster Health status is still yellow.

cmerchant · Post by **cmerchant** » Tue Mar 10, 2015 5:03 pm

I'm noticing the timestamps between the nodea and nodeb are different when the one node disconnects.

Can you confirm that you have the same clock settings between the nodes?

Nagios Support Forum

Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF