ElasticSearch : observer: timeout notification from cluster

ssoliveira · Post by **ssoliveira** » Wed May 17, 2017 3:23 pm

I have a NLS Cluster with 4 Nodes.

- Each server has 72GB of memory
- 24 CPU
- Disks SSD (Used only 273.6GB)

A network communication is fully open between the 4 servers.

After some time of execution of the cluster, it begins to present errors of "timeout notification from cluster".

Stopping all nodes in the cluster and then starting them one by one solves the problem for some time then after some time the problem repeats again.

I have a similar NLS infrastructure; Which does not have this problem (the only difference between the two environments is that the amount of memory in this cluster is larger, 72GB)

Code: Select all

[2017-05-17 14:59:35,308][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:36,048][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:40,555][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 14:59:41,737][INFO ][cluster.service          ] [765cc658-3e5f-4923-804e-5eb57735f761] master {new [8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}}, removed {[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-17 14:59:42,272][INFO ][cluster.service          ] [765cc658-3e5f-4923-804e-5eb57735f761] added {[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-17 14:59:43,456][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 8026645 numDocs: 8026645 vs. true
[2017-05-17 14:59:44,065][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 5337923 numDocs: 5337923 vs. true
[2017-05-17 14:59:44,201][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 3267651 numDocs: 3267651 vs. true
[2017-05-17 14:59:44,638][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 3490976 numDocs: 3490976 vs. true
[2017-05-17 14:59:44,817][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 30589 numDocs: 30589 vs. true
[2017-05-17 14:59:44,878][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7071508 numDocs: 7071508 vs. true
[2017-05-17 14:59:44,968][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7309883 numDocs: 7309883 vs. true
[2017-05-17 14:59:45,042][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 14 numDocs: 14 vs. true
[2017-05-17 14:59:45,066][INFO ][indices.recovery         ] [765cc658-3e5f-4923-804e-5eb57735f761] Recovery with sync ID 7267696 numDocs: 7267696 vs. true
[2017-05-17 15:47:58,117][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,118][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,699][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,699][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,705][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:58,705][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:59,115][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-17 15:47:59,116][DEBUG][action.bulk              ] [765cc658-3e5f-4923-804e-5eb57735f761] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

Some articles from discuss.elastic.co comment that servers with more than 64GB need to have dedicated masters servers; The same thing with NLS?

How do I solve this problem?

The outages are causing me problems, because when I need to restart the environment, I lose precious logs.

I'm using the nagios-plugin-elasticsearch plugin to monitor the infrastructure, which is normal.

https://github.com/anchor/nagios-plugin-elasticsearch

Code: Select all

[root@datalog-utb-log2 libexec]# /usr/bin/check_elasticsearch
Monitoring cluster 'a5726a09-769e-4f2b-be91-d786c8165c6f' | cluster_nodes=4;;;;; cluster_master_eligible_nodes=4;;;;; cluster_data_nodes=4;;;;; cluster_active_shards=112;;;;; cluster_relocating_shards=0;;;;; cluster_initialising_shards=0;;;;; cluster_unassigned_shards=0;;;;; cluster_total_shards=112;;;;; cluster_total_indices=12;;;;; cluster_closed_indices=0;;;;; storesize=145073960589B;;;;; documents=126724576c;;;;; index_ops=31256865c;;;;; index_time=12698519ms;;;;; flush_ops=742c;;;;; flush_time=75367ms;;;;; throttle_time=382464ms;;;;; index_ops=31256865c;;;;; index_time=12698519ms;;;;; delete_ops=0c;;;;; delete_time=0ms;;;;; get_ops=8c;;;;; get_time=11ms;;;;; exists_ops=8c;;;;; exists_time=11ms;;;;; missing_ops=0c;;;;; missing_time=0ms;;;;; query_ops=1886c;;;;; query_time=1744345ms;;;;; fetch_ops=139c;;;;; fetch_time=9113ms;;;;; merge_ops=16862c;;;;; refresh_ops=80026c;;;;; refresh_time=2888406ms;;;;; gc_old_count=2c;;;;; gc_young_count=8243c;;;;; heap_used=5%;;;;;

curl -XGET 'localhost:9200/_nodes/jvm?pretty'

Code: Select all

[root@datalog-utb-log2 libexec]# curl -XGET 'localhost:9200/_nodes/jvm?pretty'
{
  "cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
  "nodes" : {
    "dfP5HEcGRE6YKtN6A2t8bg" : {
      "name" : "765cc658-3e5f-4923-804e-5eb57735f761",
      "transport_address" : "inet[/10.154.9.93:9300]",
      "host" : "datalog-utb-log1.servicos",
      "ip" : "10.154.9.93",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1651,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494876174495,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "2CakXOyBTzC-c93gPAN-mg" : {
      "name" : "5c998cfb-0460-4e56-8697-83b65c086a13",
      "transport_address" : "inet[/10.154.3.100:9300]",
      "host" : "datalog-ugt-log2.gtservicos",
      "ip" : "10.154.3.100",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 5136,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494876196111,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "bvE6hf73SQGc3DP05nyWHA" : {
      "name" : "8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c",
      "transport_address" : "inet[/10.154.9.94:9300]",
      "host" : "datalog-utb-log2.servicos",
      "ip" : "10.154.9.94",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 13892,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494876148786,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "4Khu6bhxR0Cx3S0VID-9KQ" : {
      "name" : "8471b9e1-1a82-4c3d-98bc-03f2ce871369",
      "transport_address" : "inet[/10.154.3.99:9300]",
      "host" : "datalog-ugt-log1.gtservicos",
      "ip" : "10.154.3.99",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 25718,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494876176820,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}

Occasionally occur initializing shards, and some time later the cluster is normalized and turns green.

[root@datalog-utb-log2 libexec]# curl -XGET 'localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 56,
"active_shards" : 104,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 1,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}

tail /var/log/elasticsearch/a5726a09-769e-4f2b-be91-d786c8165c6f.log

Code: Select all

[root@datalog-utb-log2 libexec]# tail  /var/log/elasticsearch/a5726a09-769e-4f2b-be91-d786c8165c6f.log

[2017-05-17 15:07:17,170][WARN ][transport.netty          ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] exception caught on transport layer [[id: 0x5e7a6df0, /10.154.9.72:41630 => /10.154.9.94:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (16,3,1,0)
        at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2017-05-17 15:07:17,171][WARN ][transport.netty          ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] exception caught on transport layer [[id: 0x5e7a6df0, /10.154.9.72:41630 :> /10.154.9.94:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (16,3,1,0)
        at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:482)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireChannelDisconnected(Channels.java:396)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:360)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:81)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
        at org.elasticsearch.common.netty.channel.Channels.close(Channels.java:812)
        at org.elasticsearch.common.netty.channel.AbstractChannel.close(AbstractChannel.java:206)
        at org.elasticsearch.transport.netty.NettyTransport.exceptionCaught(NettyTransport.java:638)
        at org.elasticsearch.transport.netty.MessageChannelHandler.exceptionCaught(MessageChannelHandler.java:239)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:112)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaught(Channels.java:525)
        at org.elasticsearch.common.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:48)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.notifyHandlerException(DefaultChannelPipeline.java:658)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:566)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2017-05-17 17:11:00,428][INFO ][discovery.zen            ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] master_left [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}], reason [transport disconnected]
[2017-05-17 17:11:00,429][WARN ][discovery.zen            ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] master left (reason = transport disconnected), current nodes: {[765cc658-3e5f-4923-804e-5eb57735f761][dfP5HEcGRE6YKtN6A2t8bg][datalog-utb-log1.servicos][inet[/10.154.9.93:9300]]{max_local_storage_nodes=1},[5c998cfb-0460-4e56-8697-83b65c086a13][2CakXOyBTzC-c93gPAN-mg][datalog-ugt-log2.gtservicos][inet[/10.154.3.100:9300]]{max_local_storage_nodes=1},[8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c][bvE6hf73SQGc3DP05nyWHA][datalog-utb-log2.servicos][inet[/10.154.9.94:9300]]{max_local_storage_nodes=1},}
[2017-05-17 17:11:00,429][INFO ][cluster.service          ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] removed {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-master_failed ([8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1})
[2017-05-17 17:11:01,172][DEBUG][action.admin.cluster.state] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] no known master node, scheduling a retry
[2017-05-17 17:11:15,857][DEBUG][action.admin.cluster.health] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] no known master node, scheduling a retry
[2017-05-17 17:11:24,463][INFO ][cluster.service          ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] detected_master [8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][4Khu6bhxR0Cx3S0VID-9KQ][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[root@datalog-utb-log2 libexec]#

[root@datalog-utb-log2 libexec]# /usr/bin/check_elasticsearch -vv
One or more indexes are missing replica shards. Use -vv to list them. Index 'logstash-2017.05.14' replica down on shard 0 Index 'logstash-2017.05.10' replica down on shard 0 Index 'logstash-2017.05.10' replica down on shard 1 Index 'logstash-2017.05.10' replica down on shard 2 | cluster_nodes=4;;;;; cluster_master_eligible_nodes=4;;;;; cluster_data_nodes=4;;;;; cluster_active_shards=108;;;;; cluster_relocating_shards=0;;;;; cluster_initialising_shards=4;;;;; cluster_unassigned_shards=0;;;;; cluster_total_shards=112;;;;; cluster_total_indices=12;;;;; cluster_closed_indices=0;;;;; storesize=112737631630B;;;;; documents=99309932c;;;;; index_ops=31340845c;;;;; index_time=12734028ms;;;;; flush_ops=742c;;;;; flush_time=75367ms;;;;; throttle_time=16241ms;;;;; index_ops=31340845c;;;;; index_time=12734028ms;;;;; delete_ops=0c;;;;; delete_time=0ms;;;;; get_ops=8c;;;;; get_time=11ms;;;;; exists_ops=8c;;;;; exists_time=11ms;;;;; missing_ops=0c;;;;; missing_time=0ms;;;;; query_ops=1886c;;;;; query_time=1744345ms;;;;; fetch_ops=139c;;;;; fetch_time=9113ms;;;;; merge_ops=16887c;;;;; refresh_ops=80150c;;;;; refresh_time=2892490ms;;;;; gc_old_count=2c;;;;; gc_young_count=8264c;;;;; heap_used=4%;;;;;

Some time later the status returns to "status": "green"

However, after about 2 or 3 days with the cluster running, it crashes, and I need to restart all services.

References:
https://discuss.elastic.co/t/elasticsea ... vice/32904

Post by **cdienger** » Wed May 17, 2017 4:25 pm

Performance drops off if Elasticsearch is allocated more than 32 gigs and by default NLS will allocate half the total system memory to Elasticsearch. I'd recommend lowering the system memory if possible.

***Note that this would be for troubleshooting and we do not recommend typically making changes like this and definitely do not recommend making any other changes to the file unless recommended by us:

Another option would be to edit /etc/sysconfig/elasticsearch. Near the top of the file are the lines:

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

You could try commenting out the ES_HEAP_SIZE and set it again below to look like:

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
#ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m
ES_HEAP_SIZE=31m

I took the 31g recommendation from https://www.elastic.co/guide/en/elastic ... izing.html to help avoid crossing the 32g boundary.

Save the changes and restart elasticsearch with "service elasticsearch restart" for it to take effect.

ssoliveira · Post by **ssoliveira** » Wed May 17, 2017 4:59 pm

Decrease from 36GB to 31GB, according to your guidelines.

Before

"jvm" : {
"pid" : 25718,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1494876176820,
"mem" : {
"heap_init_in_bytes" : 37913362432,
"heap_max_in_bytes" : 37757386752,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 37757386752
},

After

"jvm" : {
"pid" : 1964,
"version" : "1.7.0_131",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "24.131-b00",
"vm_vendor" : "Oracle Corporation",
"start_time_in_millis" : 1495057631677,
"mem" : {
"heap_init_in_bytes" : 33285996544,
"heap_max_in_bytes" : 33128972288,
"non_heap_init_in_bytes" : 24313856,
"non_heap_max_in_bytes" : 224395264,
"direct_max_in_bytes" : 33128972288
},
"gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
"memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
}

Question:

When I run the TOP command, a java process with high memory consumption appears.

The process under root is logstash, which we run as root to listen on port 514.
The other process is elasticsearch, which is configured to use 31GB

================================================================================
1789 root 39 19 17.141g 691760 16108 S 2.7 0.9 0:53.48 java
1964 nagios 20 0 44.802g 0.031t 129068 S 2.0 45.6 0:25.04 java
================================================================================

[root@datalog-utb-log2 ~]# ps aux | grep 1964
nagios 1964 14.1 45.6 58100172 33780492 ? SLl 18:47 1:07 java -Xms31g -Xmx31g -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8
-Des.cluster.name=a5726a09-769e-4f2b-be91-d786c8165c6f
-Des.node.name=8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c
-Des.discovery.zen.ping.unicast.hosts=localhost,,10.154.3.99,10.154.9.94,10.154.9.93,10.154.3.100
-Des.path.repo=/ -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/local/nagioslogserver/elasticsearch -cp :/usr/local/nagioslogserver/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/local/nagioslogserver/elasticsearch/lib/*:/usr/local/nagioslogserver/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/local/nagioslogserver/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/usr/local/datalog
-Des.default.path.work=/usr/local/nagioslogserver/tmp/elasticsearch
-Des.default.path.conf=/usr/local/nagioslogserver/elasticsearch/config org.elasticsearch.bootstrap.Elasticsearch
root 5116 0.0 0.0 112648 964 pts/0 S+ 18:55 0:00 grep --color=auto 1964

This is normal?

I will monitor the environment in the next few hours; And returns here to comment on the status.

Post by **cdienger** » Thu May 18, 2017 9:35 am

Looking forward to hearing your results

This appears to be normal and I can say that my lab machine shows similar output/behavior.

ssoliveira · Post by **ssoliveira** » Thu May 18, 2017 11:54 am

After decreasing the memory from 36GB to 31GB, the problems continue.

In the logs, "observer: timeout notification from cluster service" continues to appear.

It's also popping up "[logstash-2017.05.18] [0] stop throttling indexing: numMergesInFlight = 4, maxNumMerges = 5"

[2017-05-18 04:16:54,953][DEBUG][action.bulk ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-18 04:16:55,620][DEBUG][action.bulk ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-18 04:16:55,790][DEBUG][action.bulk ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-18 04:16:55,791][DEBUG][action.bulk ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-05-18 05:07:00,971][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][0] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2017-05-18 05:07:01,168][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][3] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2017-05-18 05:07:01,280][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][3] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2017-05-18 05:07:01,281][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][3] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2017-05-18 05:07:01,326][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][0] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2017-05-18 05:07:01,595][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][3] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2017-05-18 05:07:02,323][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][0] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2017-05-18 05:07:02,445][INFO ][index.engine ] [8471b9e1-1a82-4c3d-98bc-03f2ce871369] [logstash-2017.05.18][3] now throttling indexing: numMergesInFlight=6, maxNumMerges=5

The percentage of heap memory usage is at 20%, and rising.

In addition to memory tuning, what else do we need to help make the NLS solution stable?

Post by **cdienger** » Thu May 18, 2017 3:10 pm

I assume the heap size was set on all machines to 31?

Where are these nodes located in relation to each other? Are they plugged into the same switch?

When was the last crash/reboot? I'd like to get elasticsearch's logs from all the appliances as well as a System Profile(Administration > System > System Status) from each of them. PM them to me when you have a chance.

ssoliveira · Post by **ssoliveira** » Thu May 18, 2017 3:53 pm

Hello,

All servers are configured to use 31GB of memory; Via file "/etc/sysconfig/elasticsearch".

A few hours ago;aAnalyzing the log, we see the message "Java.io.StreamCorruptedException: invalid internal transport message format, got (3,41,4d, 52)". Some time later the cluster presented problems.

We opted to upgrade from java from version 1.7.0_131 to 1.7.0_141; and restart all servers.

Two servers are in a rack; And the other 2 are in other rack; Both connected on different switches.
We have already done extensive network testing; with iperf;
And other tests performed by Cisco technical support; which ensured that there were no communication problems.

2 servers are on one L3, and the other 2 on another L3;
To meet high availability requirements. There is no firewall rule among the 4 servers;
With the exception of the local rules themselves in iptables, which are in agreement with the Nagios setup.

The attached "system-profile.tar.gz"

I have verified that the files "firewalld.txt" and "iptables.txt" are blank in the file "system-profile.tar.gz" (I believe that the routine that generated the file could not collect the information).

Attached is the file "/etc/sysconfig/iptables"

The system is stable for 46 minutes (no errors; yet).

We continue to monitor the infrastructure.
If you need more information; Please inform me.

I will return with new information in a few hours; Commenting on the state of the environment stability.

Post by **cdienger** » Thu May 18, 2017 4:22 pm

Thanks for all the details. I was hoping to get 4 system profiles(one from each instance) as well as the logs in /var/log/elasticsearch/ from each machine as well. If they are too large to PM or post, please upload them somewhere and send me the info to download them.

ssoliveira · Post by **ssoliveira** » Thu May 18, 2017 5:02 pm

Hi, the logs are attached.

Post by **cdienger** » Fri May 19, 2017 11:48 am

Thanks for the logs! I think this is triggered due to merging. In addition to the numMergesInFlight messages around the time of the timeout messages there was this from a couple days ago:

Code: Select all

java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:714)
	at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
	at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
	at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
	at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:793)
	at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1237)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
[2017-05-15 12:34:52,408][WARN ][index.shard              ] [8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c] [logstash-2017.05.15][0] Failed to perform scheduled engine optimize/merge

I would recommend you to disable optimization under Administration > System > Backup & Maintenance, by setting "Optimize Indexes older than" to 0.

Nagios Support Forum

ElasticSearch : observer: timeout notification from cluster

ElasticSearch : observer: timeout notification from cluster

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus

Re: ElasticSearch : observer: timeout notification from clus