Problem: Elasticsearch does not start correctly on boot

ssoliveira · Post by **ssoliveira** » Fri May 05, 2017 3:17 pm

The "elasticsearch" service does not start correctly when restart the server.

Every time I restart the server, I need to manually restart the "elasticsearch" service.

Considerations:

This problem does not occur in CentOS 6
This problem only occurs when I install the NLS infrastructure in CentOS 7.

# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

Nagios Log Server: 1.4.4
Elasticsearch: 1.6.0
Logstash: 1.5.1
Kibana: 3.1.1-nagios3

##### After a reboot the elasticsearch service does not start automatically.
##### Displaying the following error.

[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[2671], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:20:30,254][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:20:30,341][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:20:34,736][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:20:34,750][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/OPCqK_rHRAmtCmGtOMb7mA
[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

##### After the problem occurs.
##### If I perform a restarts of elasticsearch with
##### the "systemctl restart elasticsearch" command, the service initializes correctly.
##### Apparently, there is some bug, or problem with the dependencies of the service.

[2017-05-05 18:30:22,255][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopping ...
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopped
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closing ...
[2017-05-05 18:30:22,622][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closed
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[6112], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:30:29,950][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:30:29,992][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:30:32,726][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:30:32,737][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/bdD-q2t8Qpac9a1BLyxEGQ
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,478][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,479][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:35,812][INFO ][cluster.service ] [5c998cfb-0460-4e56-8697-83b65c086a13] detected_master [8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-05 18:30:35,934][INFO ][http ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[localhost/127.0.0.1:9200]}
[2017-05-05 18:30:35,934][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] started

##### That may be trying to start while the network is not yet available (it's my suspicion).
##### I believe the version used is not natively compatible with systemd, which still uses chkconfig internally.

# Hypotheses

## Bug
## Problems with dependencies
## Bug related to installation in CentOS 7 (current), with all packages updated

Post by **mcapra** » Mon May 08, 2017 9:37 am

Is this a multi-instance environment?

This message implies that, when Elasticsearch first starts, it's unable to find a path to it's cluster members:

Code: Select all

[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host

What I think may be causing this is a delayed network connection. Basically, Elasticsearch is trying to join the cluster ahead of the machine actually being on the network.

ssoliveira · Post by **ssoliveira** » Wed May 10, 2017 3:43 pm

Yes, it is a group of 4 servers, in the same L2.

I rebuild the environment; in the laboratory; with 4 servers in CentOS 7, and I can replicate the problem.
After running reboot, I need to wait for the server to restart and then restart the elasticsearch.

Rebuild the environment; in the laboratory; with CentOS 6, and the problem does not occur.

All servers are 100% upgraded via yum.

The environment does not have customizations, following strictly as documentation of the nagios.

Post by **tacolover101** » Wed May 10, 2017 5:05 pm

is ipv6 enabled? a thought that comes to mind, is it may be trying to connect to peers through this method.

the other problem could be port 9300 blocked somewhere between the nodes.

Post by **mcapra** » Thu May 11, 2017 1:39 pm

I've walked through this in my lab environment with 4 nodes and am still unable to replicate. I do a hard-reset on each CentOS 7 VM, and all members rejoin the cluster without further issues:

Code: Select all

[root@nls1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a4e1a590-f811-4545-b7a9-c5b160afebc2",
  "nodes" : {
    "u1wZXAG7TpeZFKjg2-C_7w" : {
      "name" : "ac1679a4-b4f0-46ff-9203-b1277072c5c6",
      "transport_address" : "inet[/192.168.67.5:9300]",
      "host" : "nls2",
      "ip" : "192.168.67.5",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1751,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527541932,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "MiASaXZWQIiHibXz0NyhrQ" : {
      "name" : "4c832c80-f512-4cae-9c65-5159efe1a336",
      "transport_address" : "inet[/192.168.67.6:9300]",
      "host" : "nls3",
      "ip" : "192.168.67.6",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1999,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527544857,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "19aL6Jf-SkiEqapDsRVx1A" : {
      "name" : "d7bfa7b4-a76a-4871-a120-e3399e7f2492",
      "transport_address" : "inet[/192.168.67.4:9300]",
      "host" : "nls1",
      "ip" : "192.168.67.4",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/192.168.67.4:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1469,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542150,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "BFC4Rd_GTOm_tUmiCMgYpQ" : {
      "name" : "34127c70-8136-4ef6-896a-2010ea231c2a",
      "transport_address" : "inet[/192.168.67.7:9300]",
      "host" : "nls4",
      "ip" : "192.168.67.7",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1577,
        "version" : "1.7.0_141",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.141-b02",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542557,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}

Can we get all of the Elasticsearch logs from each instance? They're typically found here:

Code: Select all

/var/log/elasticsearch/

ssoliveira · Post by **ssoliveira** » Mon May 15, 2017 11:31 am

I'm going to restart the cluster again; And send the logs.
Follow the JVM settings

curl -XGET localhost:9200/_nodes/jvm?pretty

Code: Select all

[root@datalog-ugt-log1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
  "nodes" : {
    "jAGKtfkcTCG1qGDBedVgsA" : {
      "name" : "8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c",
      "transport_address" : "inet[/10.154.9.94:9300]",
      "host" : "datalog-utb-log2.servicos",
      "ip" : "10.154.9.94",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 18890,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865667666,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "ToOWGPogRZqbohZCnI3f8w" : {
      "name" : "765cc658-3e5f-4923-804e-5eb57735f761",
      "transport_address" : "inet[/10.154.9.93:9300]",
      "host" : "datalog-utb-log1.servicos",
      "ip" : "10.154.9.93",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 4504,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865549411,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "h8ilRIIsTsyUb7BQaZm9uw" : {
      "name" : "8471b9e1-1a82-4c3d-98bc-03f2ce871369",
      "transport_address" : "inet[/10.154.3.99:9300]",
      "host" : "datalog-ugt-log1.gtservicos",
      "ip" : "10.154.3.99",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 24531,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865394077,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "Z9Ei_i-8SFSGSzYY6F68Kw" : {
      "name" : "5c998cfb-0460-4e56-8697-83b65c086a13",
      "transport_address" : "inet[/10.154.3.100:9300]",
      "host" : "datalog-ugt-log2.gtservicos",
      "ip" : "10.154.3.100",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 27785,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865490822,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}

Post by **cdienger** » Mon May 15, 2017 2:32 pm

Thanks for the update and please keep us posted with the results/logs of the reboot.

The output provided would suggest the service was manually started on the localhost before the command was run, is this a correct assessment?

Nagios Support Forum

Problem: Elasticsearch does not start correctly on boot

Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot

Re: Problem: Elasticsearch does not start correctly on boot