Cluster 2nd Node OFF

teirekos · Post by **teirekos** » Wed Mar 11, 2015 8:35 am

Indeed there was a problem with the time in second node. We have fixed it now and we rebooted both nodes again. Time is correct now! I 'll wait for the cluster behavior...

cmerchant · Post by **cmerchant** » Wed Mar 11, 2015 8:41 am

Hope that clears the issue, keep us updated. Thanks.

teirekos · Post by **teirekos** » Mon Mar 16, 2015 3:26 am

problem persists. I attach you elasticsearch logs in DEBUG mode

cmerchant · Post by **cmerchant** » Mon Mar 16, 2015 12:03 pm

The most recent logs that you sent have the latest timestamp of elasiticsearchlogs_A was 3/10/2015 : 08:18PM and elasticsearchlogs_B was 3/11/2015 : 02:17AM. Either I am looking at the same log entries from before, but the time difference is the same +06:00?

Can you issue the following commands from both Nagios Log Servers:

Code: Select all

date

teirekos · Post by **teirekos** » Tue Mar 17, 2015 3:01 am

Code: Select all

[root@NagiosLogServer /]# date
Tue Mar 17 10:00:33 EET 2015

Code: Select all

[root@NagiosLogServer2 /]# date
Tue Mar 17 10:00:33 EET 2015

Also I do not have any elasticsearch log generation from 03/11/15.
Only in my 1st node I've put back the INFO logging and restarted but still no logs.

jolson · Post by **jolson** » Tue Mar 17, 2015 10:34 am

On both of your nodes, please run the following commands.
See master:

Code: Select all

curl 'localhost:9200/_cat/master?v'

See nodes:

Code: Select all

curl 'localhost:9200/_cat/nodes?v'

Pending tasks:

Code: Select all

curl 'localhost:9200/_cat/pending_tasks?v'

See recovery:

Code: Select all

curl -XGET 'localhost:9200/_cat/recovery?v'

Please post the results back to us. As for your logs - I would check on your elasticsearch configuration file and ensure that everything looks proper:

Code: Select all

grep LOG_DIR /etc/sysconfig/elasticsearch

Code: Select all

cat /usr/local/nagioslogserver/elasticsearch/config/logging.yml

teirekos · Post by **teirekos** » Wed Mar 18, 2015 11:02 am

I've found the log problem. Now I have proper debug logs. But I had to restart both nodes so cluster is ok at the moment, but soon will fail again.
I attach the info you asked for and I will send fresh elasticsearch debug logs a soon as the problem reoccurs.

Also I want to report the following just in case it is related somehow. In my 1st node after the reboot logstash service does not start. I have to start it manually.
Message is:"service logstash status"
"the logstash daemon dead, but pid file exists."
There is a related forum entry in the past but it is not clear where it resulted...

Thanx.

jolson · Post by **jolson** » Wed Mar 18, 2015 1:48 pm

There are no split brain symptoms in your logs (both nodes point to one master, which looks proper). The results from "curl -XGET 'localhost:9200/_cat/recovery?v'" did look a little strange though - any chance you could run that command one more time on each node while we're waiting on those logs?

Best,

Jesse

teirekos · Post by **teirekos** » Thu Mar 19, 2015 10:31 am

I attach the recovery info as requested as long as the latest elasticsearch logs in DEBUG mode since my cluster is off again...
Thanx

jolson · Post by **jolson** » Thu Mar 19, 2015 11:17 am

teirekos,

Thank you for all of the help you've given us so far. I am looking through the logs you have provided. In the meantime, if I could get you to run the following command on each node, I would appreciate it:

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'

Please report the output here.

There are many logs that point to the "logstash-2015.03.09" index as being corrupt. For instance:

Code: Select all

Line 5267: [2015-03-18 08:15:15,042][WARN ][index.engine.internal    ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
	Line 5268: [2015-03-18 08:15:15,048][WARN ][indices.cluster          ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
	Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5282: [2015-03-18 08:15:15,052][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [engine failure, message [corrupted preexisting index][CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5306: [2015-03-18 08:15:17,984][WARN ][index.engine.internal    ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
	Line 5307: [2015-03-18 08:15:17,984][WARN ][indices.cluster          ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
	Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]

I suggest either closing or deleting that Index, at least until we resolve this problem. If you don't care about data from that index, you could run the following on each node to delete it:

Code: Select all

curl -XDELETE 'http://localhost:9200/logstash-2015.03.09/'

Nagios Support Forum

Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF

Re: Cluster 2nd Node OFF