Page 2 of 2

Re: Shards stay unassigned after yum update

Posted: Thu Sep 10, 2015 2:13 am
by WillemDH
Jesse,

Hmm i'm not sure if things are working as expected, but as you seemed to say adding the second data path would make the cluster start replicate the data between the data paths, I added the second data path /mnt/data to the /etc/sysconfig/elasticsearch config file.
My cluster health changed from yellow to red now however.

Code: Select all

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "ee9e60a0-f4cb-41ec-a97f-8f17434b748e",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 169,
  "active_shards" : 338,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}
There does seem to be data in the new data path /mnt/data so maybe it will take some time.

I'll update this thread if the situation changes.

EDIT 1: Apparently the Logstash service on node 1 where I added the data path stopped. Log are no longer coming in on both nodes... :(
EDIT 2: It seems when I try to restart the Logstash service it stops again after some time.
EDIT 3: These are the errors I get in the elasticsearch lgos on node 1:

Code: Select all

[2015-09-10 09:26:09,283][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,283][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,323][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,324][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,342][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:09,343][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
On node 2:

Code: Select all

[2015-09-10 09:25:33,712][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:26:03,714][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:26:06,553][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:06,554][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,704][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,704][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:07,752][INFO ][cluster.metadata         ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.09.10] update_mapping [syslog-brocade] (dynamic)
[2015-09-10 09:26:08,687][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:08,687][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:14,594][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:14,594][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:26:33,717][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:27:03,712][INFO ][cluster.routing.allocation.decider] [95f9ab14-da22-4144-bb0b-6bbc5662115c] low disk watermark [85%] exceeded on [SbqNgsIXQpG_c9o3mKwYiQ][95f9ab14-da22-4144-bb0b-6bbc5662115c] free: 18.9gb[12.8%], replicas will not be assigned to this node
[2015-09-10 09:27:06,579][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:06,579][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:07,732][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:07,732][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:08,736][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:08,744][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:14,612][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-09-10 09:27:14,612][DEBUG][action.bulk              ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
I'm also seeing this erro in console:

Code: Select all

Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
EDIT 4:

I keep getting outofmemory exceptions it seems.

Code: Select all

ConcurrencyError: interrupted waiting for mutex: null
                       lock at org/jruby/ext/thread/Mutex.java:94
          execute_task_once at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/delay.rb:83
                       wait at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/delay.rb:60
                      value at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/obligation.rb:47
           global_timer_set at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:58
  finalize_global_executors at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:137
                 Concurrent at /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/concurrent-ruby-0.8.0-java/lib/concurrent/configuration.rb:165
Sep 10, 2015 10:32:35 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Sep 10, 2015 10:32:54 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
java.lang.OutOfMemoryError: GC overhead limit exceeded

Sep 10, 2015 10:32:35 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Sep 10, 2015 10:32:55 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [16] timed out after [21001ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Sep 10, 2015 10:32:55 AM org.elasticsearch.transport.netty.NettyTransport exceptionCaught
WARNING: [c4d16075-9bc2-4095-9f00-e7de7f96930c] exception caught on transport layer [[id: 0x89a930b9, /127.0.0.1:33448 => localhost/127.0.0.1:9300]], closing connection
java.lang.OutOfMemoryError: GC overhead limit exceeded

Sep 10, 2015 10:32:55 AM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] failed to get node info for [#transport#-1][srvnaglog01][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [17] timed out after [13301ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Sep 10, 2015 10:32:55 AM org.elasticsearch.transport.netty.NettyTransport exceptionCaught
WARNING: [c4d16075-9bc2-4095-9f00-e7de7f96930c] exception caught on transport layer [[id: 0x89a930b9, /127.0.0.1:33448 :> localhost/127.0.0.1:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (5a,56,1,1)
        at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:47)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
EDIT 5:

Ok now even more troubles. The snapshot I took before extending the disk with 20 Gb from 150 to 170 GB has exploded in one hour time to 869 GB. This resulted in the datastore having 0 bytes free and crashing the vm. The other VM's on the datastore also experienced problems. It might be a good idea to mention this when advising people to take snapshots of nls nodes. :)

Anyway trying to recover from the whole situation atm.

EDIT 6:

So I managed to recover from most issue. I solved the cluster red status and unassigned shards with these commands:

Code: Select all

curl 'http://localhost:9200/_cat/shards' | grep UNASS
The above listed my unassigned shards with index and shard numbers. Then I did:

Code: Select all

curl -XPOST -d '{ "commands" : [ { "allocate" : { "index" : "logstash-2015.09.10", "shard" : 3, "node" : "srvnaglog01", "allow_primary" : true } } ] }' http://localhost:9200/_cluster/reroute?pretty
For each unassigned shard I found.

Seem i have only one issue left atm and that is the Logstash on one of the nodes that keeps crashing from time to time with outofmemory exception.

EDIT 7:

As things are pretty uch 'normal' atm, I decided to update to latest NLS version 2.2 from 2.1. But when I follow the procedure to udapte the NLS and execute the upgrade script, I get:

Code: Select all

./upgrade
patching file /usr/lib/python2.6/site-packages/curator/curator.py
Reversed (or previously applied) patch detected!  Assume -R? [n]
What should I do with this?

Grtz

Willem

Re: Shards stay unassigned after yum update

Posted: Thu Sep 10, 2015 9:41 am
by jolson
Very glad to hear that you got this mostly resolved.
As things are pretty uch 'normal' atm, I decided to update to latest NLS version 2.2 from 2.1. But when I follow the procedure to udapte the NLS and execute the upgrade script, I get:
./upgrade
patching file /usr/lib/python2.6/site-packages/curator/curator.py
Reversed (or previously applied) patch detected! Assume -R? [n]
What should I do with this?
You can go ahead and press 'n' when that question appears. It's asking if you'd like to overwrite the old version of curator.py, even though yours is already updated.

Re: Shards stay unassigned after yum update

Posted: Thu Sep 10, 2015 11:01 am
by WillemDH
Very glad to hear that you got this mostly resolved.
Yea well we never saw a snapshot grow bigger then two times the size of the vm lol. Had some new experiences today and learned a lot. You can close this thread. Managed to update both nodes to R2.2.

Grtz.

Willem

Re: Shards stay unassigned after yum update

Posted: Thu Sep 10, 2015 12:23 pm
by jolson
Yea well we never saw a snapshot grow bigger then two times the size of the vm lol. Had some new experiences today and learned a lot. You can close this thread. Managed to update both nodes to R2.2.
Will do - I imagine that the snapshot growth was due to the rebalancing of information that happened when you defined your second data directory or similar - it's hard to say. From VMWare documentation:
Be especially diligent with snapshot use on high-transaction virtual machines such as email and database servers. These snapshots can very quickly grow in size, filling datastore space. Commit snapshots on these virtual machines as soon as you have verified the proper working state of the process you are testing.
I'll close this thread as resolved. Thanks Willem!