Cluster 2nd Node OFF

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
teirekos
Posts: 110
Joined: Wed Nov 26, 2014 6:06 am

Re: Cluster 2nd Node OFF

Post by teirekos »

Indeed there was a problem with the time in second node. We have fixed it now and we rebooted both nodes again. Time is correct now! I 'll wait for the cluster behavior...
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Cluster 2nd Node OFF

Post by cmerchant »

Hope that clears the issue, keep us updated. Thanks.
teirekos
Posts: 110
Joined: Wed Nov 26, 2014 6:06 am

Re: Cluster 2nd Node OFF

Post by teirekos »

problem persists. I attach you elasticsearch logs in DEBUG mode
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Cluster 2nd Node OFF

Post by cmerchant »

The most recent logs that you sent have the latest timestamp of elasiticsearchlogs_A was 3/10/2015 : 08:18PM and elasticsearchlogs_B was 3/11/2015 : 02:17AM. Either I am looking at the same log entries from before, but the time difference is the same +06:00?

Can you issue the following commands from both Nagios Log Servers:

Code: Select all

date
teirekos
Posts: 110
Joined: Wed Nov 26, 2014 6:06 am

Re: Cluster 2nd Node OFF

Post by teirekos »

Code: Select all

[root@NagiosLogServer /]# date
Tue Mar 17 10:00:33 EET 2015

Code: Select all

[root@NagiosLogServer2 /]# date
Tue Mar 17 10:00:33 EET 2015
Also I do not have any elasticsearch log generation from 03/11/15.
Only in my 1st node I've put back the INFO logging and restarted but still no logs.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Cluster 2nd Node OFF

Post by jolson »

On both of your nodes, please run the following commands.
See master:

Code: Select all

curl 'localhost:9200/_cat/master?v'
See nodes:

Code: Select all

curl 'localhost:9200/_cat/nodes?v'
Pending tasks:

Code: Select all

curl 'localhost:9200/_cat/pending_tasks?v'
See recovery:

Code: Select all

curl -XGET 'localhost:9200/_cat/recovery?v'
Please post the results back to us. As for your logs - I would check on your elasticsearch configuration file and ensure that everything looks proper:

Code: Select all

grep LOG_DIR /etc/sysconfig/elasticsearch

Code: Select all

cat /usr/local/nagioslogserver/elasticsearch/config/logging.yml
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
teirekos
Posts: 110
Joined: Wed Nov 26, 2014 6:06 am

Re: Cluster 2nd Node OFF

Post by teirekos »

I've found the log problem. Now I have proper debug logs. But I had to restart both nodes so cluster is ok at the moment, but soon will fail again.
I attach the info you asked for and I will send fresh elasticsearch debug logs a soon as the problem reoccurs.

Also I want to report the following just in case it is related somehow. In my 1st node after the reboot logstash service does not start. I have to start it manually.
Message is:"service logstash status"
"the logstash daemon dead, but pid file exists."
There is a related forum entry in the past but it is not clear where it resulted...

Thanx.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Cluster 2nd Node OFF

Post by jolson »

There are no split brain symptoms in your logs (both nodes point to one master, which looks proper). The results from "curl -XGET 'localhost:9200/_cat/recovery?v'" did look a little strange though - any chance you could run that command one more time on each node while we're waiting on those logs?

Best,


Jesse
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
teirekos
Posts: 110
Joined: Wed Nov 26, 2014 6:06 am

Re: Cluster 2nd Node OFF

Post by teirekos »

I attach the recovery info as requested as long as the latest elasticsearch logs in DEBUG mode since my cluster is off again...
Thanx
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Cluster 2nd Node OFF

Post by jolson »

teirekos,

Thank you for all of the help you've given us so far. I am looking through the logs you have provided. In the meantime, if I could get you to run the following command on each node, I would appreciate it:

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health/*?level=shards'
Please report the output here.

There are many logs that point to the "logstash-2015.03.09" index as being corrupt. For instance:

Code: Select all

Line 5267: [2015-03-18 08:15:15,042][WARN ][index.engine.internal    ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
	Line 5268: [2015-03-18 08:15:15,048][WARN ][indices.cluster          ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
	Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5281: [2015-03-18 08:15:15,050][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5282: [2015-03-18 08:15:15,052][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [engine failure, message [corrupted preexisting index][CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5306: [2015-03-18 08:15:17,984][WARN ][index.engine.internal    ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed engine [corrupted preexisting index]
	Line 5307: [2015-03-18 08:15:17,984][WARN ][indices.cluster          ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] failed to start shard
	Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
	Line 5320: [2015-03-18 08:15:17,985][WARN ][cluster.action.shard     ] [845bc07c-ed91-4920-8e23-747c9cc699f5] [logstash-2015.03.09][0] sending failed shard for [logstash-2015.03.09][0], node[UZrxQW1RRFy46Aj58Klatg], [R], s[INITIALIZING], indexUUID [AjrFVDrpTBuMwm8crIvq-g], reason [Failed to start shard, message [CorruptIndexException[[logstash-2015.03.09][0] Corrupted index [corrupted_-_Vq1X79SB6Z5YXnFRr-vw] caused by: CorruptIndexException[codec footer mismatch: actual footer=-522723112 vs expected footer=-1071082520 (resource: NIOFSIndexInput(path="/usr/local/nagioslogserver/elasticsearch/data/2b249934-e049-4f18-96ed-db395faae965/nodes/0/indices/logstash-2015.03.09/0/index/_caa_es090_0.pos"))]]]]
I suggest either closing or deleting that Index, at least until we resolve this problem. If you don't care about data from that index, you could run the following on each node to delete it:

Code: Select all

curl -XDELETE 'http://localhost:9200/logstash-2015.03.09/'
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked