Cluster 2nd Node OFF
Re: Cluster 2nd Node OFF
Indeed there was a problem with the time in second node. We have fixed it now and we rebooted both nodes again. Time is correct now! I 'll wait for the cluster behavior...
Re: Cluster 2nd Node OFF
Hope that clears the issue, keep us updated. Thanks.
Re: Cluster 2nd Node OFF
problem persists. I attach you elasticsearch logs in DEBUG mode
Re: Cluster 2nd Node OFF
The most recent logs that you sent have the latest timestamp of elasiticsearchlogs_A was 3/10/2015 : 08:18PM and elasticsearchlogs_B was 3/11/2015 : 02:17AM. Either I am looking at the same log entries from before, but the time difference is the same +06:00?
Can you issue the following commands from both Nagios Log Servers:
Can you issue the following commands from both Nagios Log Servers:
Code: Select all
dateRe: Cluster 2nd Node OFF
Code: Select all
[root@NagiosLogServer /]# date
Tue Mar 17 10:00:33 EET 2015Code: Select all
[root@NagiosLogServer2 /]# date
Tue Mar 17 10:00:33 EET 2015Only in my 1st node I've put back the INFO logging and restarted but still no logs.
Re: Cluster 2nd Node OFF
On both of your nodes, please run the following commands.
See master:
See nodes:
Pending tasks:
See recovery:
Please post the results back to us. As for your logs - I would check on your elasticsearch configuration file and ensure that everything looks proper:
See master:
Code: Select all
curl 'localhost:9200/_cat/master?v'Code: Select all
curl 'localhost:9200/_cat/nodes?v'Code: Select all
curl 'localhost:9200/_cat/pending_tasks?v'Code: Select all
curl -XGET 'localhost:9200/_cat/recovery?v'Code: Select all
grep LOG_DIR /etc/sysconfig/elasticsearchCode: Select all
cat /usr/local/nagioslogserver/elasticsearch/config/logging.ymlRe: Cluster 2nd Node OFF
I've found the log problem. Now I have proper debug logs. But I had to restart both nodes so cluster is ok at the moment, but soon will fail again.
I attach the info you asked for and I will send fresh elasticsearch debug logs a soon as the problem reoccurs.
Also I want to report the following just in case it is related somehow. In my 1st node after the reboot logstash service does not start. I have to start it manually.
Message is:"service logstash status"
"the logstash daemon dead, but pid file exists."
There is a related forum entry in the past but it is not clear where it resulted...
Thanx.
I attach the info you asked for and I will send fresh elasticsearch debug logs a soon as the problem reoccurs.
Also I want to report the following just in case it is related somehow. In my 1st node after the reboot logstash service does not start. I have to start it manually.
Message is:"service logstash status"
"the logstash daemon dead, but pid file exists."
There is a related forum entry in the past but it is not clear where it resulted...
Thanx.
Re: Cluster 2nd Node OFF
There are no split brain symptoms in your logs (both nodes point to one master, which looks proper). The results from "curl -XGET 'localhost:9200/_cat/recovery?v'" did look a little strange though - any chance you could run that command one more time on each node while we're waiting on those logs?
Best,
Jesse
Best,
Jesse
Re: Cluster 2nd Node OFF
I attach the recovery info as requested as long as the latest elasticsearch logs in DEBUG mode since my cluster is off again...
Thanx
Thanx
Re: Cluster 2nd Node OFF
teirekos,
Thank you for all of the help you've given us so far. I am looking through the logs you have provided. In the meantime, if I could get you to run the following command on each node, I would appreciate it:
Please report the output here.
There are many logs that point to the "logstash-2015.03.09" index as being corrupt. For instance:
I suggest either closing or deleting that Index, at least until we resolve this problem. If you don't care about data from that index, you could run the following on each node to delete it:
Thank you for all of the help you've given us so far. I am looking through the logs you have provided. In the meantime, if I could get you to run the following command on each node, I would appreciate it:
Code: Select all
curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'There are many logs that point to the "logstash-2015.03.09" index as being corrupt. For instance:
Code: Select all
Line 5267: [2015-03-18 08:15:15,042][WARN ][index.engine.internal ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
Line 5268: [2015-03-18 08:15:15,048][WARN ][indices.cluster ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
Line 5282: [2015-03-18 08:15:15,052][WARN ][cluster.action.shard ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [engine failure, message [corrupted preexisting index][CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
Line 5306: [2015-03-18 08:15:17,984][WARN ][index.engine.internal ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
Line 5307: [2015-03-18 08:15:17,984][WARN ][indices.cluster ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
Code: Select all
curl -XDELETE 'http://localhost:9200/logstash-2015.03.09/'