Re: Shards stay unassigned after yum update
Posted: Thu Sep 10, 2015 2:13 am
Jesse,
Hmm i'm not sure if things are working as expected, but as you seemed to say adding the second data path would make the cluster start replicate the data between the data paths, I added the second data path /mnt/data to the /etc/sysconfig/elasticsearch config file.
My cluster health changed from yellow to red now however.
There does seem to be data in the new data path /mnt/data so maybe it will take some time.
I'll update this thread if the situation changes.
EDIT 1: Apparently the Logstash service on node 1 where I added the data path stopped. Log are no longer coming in on both nodes...
EDIT 2: It seems when I try to restart the Logstash service it stops again after some time.
EDIT 3: These are the errors I get in the elasticsearch lgos on node 1:
On node 2:
I'm also seeing this erro in console:
EDIT 4:
I keep getting outofmemory exceptions it seems.
EDIT 5:
Ok now even more troubles. The snapshot I took before extending the disk with 20 Gb from 150 to 170 GB has exploded in one hour time to 869 GB. This resulted in the datastore having 0 bytes free and crashing the vm. The other VM's on the datastore also experienced problems. It might be a good idea to mention this when advising people to take snapshots of nls nodes.
Anyway trying to recover from the whole situation atm.
EDIT 6:
So I managed to recover from most issue. I solved the cluster red status and unassigned shards with these commands:
The above listed my unassigned shards with index and shard numbers. Then I did:
For each unassigned shard I found.
Seem i have only one issue left atm and that is the Logstash on one of the nodes that keeps crashing from time to time with outofmemory exception.
EDIT 7:
As things are pretty uch 'normal' atm, I decided to update to latest NLS version 2.2 from 2.1. But when I follow the procedure to udapte the NLS and execute the upgrade script, I get:
What should I do with this?
Grtz
Willem
Hmm i'm not sure if things are working as expected, but as you seemed to say adding the second data path would make the cluster start replicate the data between the data paths, I added the second data path /mnt/data to the /etc/sysconfig/elasticsearch config file.
My cluster health changed from yellow to red now however.
Code: Select all
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "ee9e60a0-f4cb-41ec-a97f-8f17434b748e",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 169,
"active_shards" : 338,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
I'll update this thread if the situation changes.
EDIT 1: Apparently the Logstash service on node 1 where I added the data path stopped. Log are no longer coming in on both nodes...
EDIT 2: It seems when I try to restart the Logstash service it stops again after some time.
EDIT 3: These are the errors I get in the elasticsearch lgos on node 1:
Code: Select all
[2015-09-10 09:26:09,283][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,283][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,323][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,324][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,342][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,343][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Code: Select all
[2015-09-10 09:25:33,712][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:26:03,714][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:26:06,553][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:06,554][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,704][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,704][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,752][INFO ][cluster.metadata ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.09.10] update_mapping [syslog-brocade] (dynamic)
[2015-09-10 09:26:08,687][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:08,687][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:14,594][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:14,594][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:33,717][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:27:03,712][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:27:06,579][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:06,579][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:07,732][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:07,732][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:08,736][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:08,744][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:14,612][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:14,612][DEBUG][action.bulk ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
Code: Select all
Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
I keep getting outofmemory exceptions it seems.
Code: Select all
ConcurrencyError: interrupted waiting for mutex: null
lock at org/jruby/ext/thread/Mutex.java:94
execute_task_once at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/delay.rb:83
wait at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/delay.rb:60
value at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/obligation.rb:47
global_timer_set at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:58
finalize_global_executors at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:137
Concurrent at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:165
Sep 10, 2015 10:32:35 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Sep 10, 2015 10:32:54 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
java.lang.OutOfMemoryError: GC overhead limit exceeded
Sep 10, 2015 10:32:35 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Sep 10, 2015 10:32:55 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [16] timed out after [21001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Sep 10, 2015 10:32:55 AM org.elasticsearch.transport.netty.NettyTransport exceptionCaught
WARNING: [c4d16075-9bc2-4095-9f00-e7de7f96930c] exception caught on transport layer [[id: 0x89a930b9, /127.0.0.1:33448 => localhost/127.0.0.1:9300]], closing connection
java.lang.OutOfMemoryError: GC overhead limit exceeded
Sep 10, 2015 10:32:55 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [17] timed out after [13301ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Sep 10, 2015 10:32:55 AM org.elasticsearch.transport.netty.NettyTransport exceptionCaught
WARNING: [c4d16075-9bc2-4095-9f00-e7de7f96930c] exception caught on transport layer [[id: 0x89a930b9, /127.0.0.1:33448 :> localhost/127.0.0.1:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (5a,56,1,1)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:47)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
Ok now even more troubles. The snapshot I took before extending the disk with 20 Gb from 150 to 170 GB has exploded in one hour time to 869 GB. This resulted in the datastore having 0 bytes free and crashing the vm. The other VM's on the datastore also experienced problems. It might be a good idea to mention this when advising people to take snapshots of nls nodes.
Anyway trying to recover from the whole situation atm.
EDIT 6:
So I managed to recover from most issue. I solved the cluster red status and unassigned shards with these commands:
Code: Select all
curl 'http://localhost:9200/_cat/shards' | grep UNASSCode: Select all
curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2015.09.10", "shard" : 3, "node" : "srvnaglog01", "allow_primary" : true } } ] }' http://localhost:9200/_cluster/reroute?prettySeem i have only one issue left atm and that is the Logstash on one of the nodes that keeps crashing from time to time with outofmemory exception.
EDIT 7:
As things are pretty uch 'normal' atm, I decided to update to latest NLS version 2.2 from 2.1. But when I follow the procedure to udapte the NLS and execute the upgrade script, I get:
Code: Select all
./upgrade
patching file /usr/lib/python2.6/site-packages/curator/curator.py
Reversed (or previously applied) patch detected! Assume -R? [n]
Grtz
Willem