Page 1 of 1

Problem: Elasticsearch does not start correctly on boot

Posted: Fri May 05, 2017 3:17 pm
by ssoliveira
The "elasticsearch" service does not start correctly when restart the server.

Every time I restart the server, I need to manually restart the "elasticsearch" service.

Considerations:

This problem does not occur in CentOS 6
This problem only occurs when I install the NLS infrastructure in CentOS 7.

# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

Nagios Log Server: 1.4.4
Elasticsearch: 1.6.0
Logstash: 1.5.1
Kibana: 3.1.1-nagios3


##### After a reboot the elasticsearch service does not start automatically.
##### Displaying the following error.

[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[2671], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:20:30,254][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:20:30,341][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:20:34,736][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:20:34,750][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/OPCqK_rHRAmtCmGtOMb7mA
[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


##### After the problem occurs.
##### If I perform a restarts of elasticsearch with
##### the "systemctl restart elasticsearch" command, the service initializes correctly.
##### Apparently, there is some bug, or problem with the dependencies of the service.

[2017-05-05 18:30:22,255][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopping ...
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopped
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closing ...
[2017-05-05 18:30:22,622][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closed
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[6112], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:30:29,950][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:30:29,992][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:30:32,726][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:30:32,737][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/bdD-q2t8Qpac9a1BLyxEGQ
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,478][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,479][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:35,812][INFO ][cluster.service ] [5c998cfb-0460-4e56-8697-83b65c086a13] detected_master [8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-05 18:30:35,934][INFO ][http ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[localhost/127.0.0.1:9200]}
[2017-05-05 18:30:35,934][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] started

##### That may be trying to start while the network is not yet available (it's my suspicion).
##### I believe the version used is not natively compatible with systemd, which still uses chkconfig internally.

# Hypotheses

## Bug
## Problems with dependencies
## Bug related to installation in CentOS 7 (current), with all packages updated

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Mon May 08, 2017 9:37 am
by mcapra
Is this a multi-instance environment?

This message implies that, when Elasticsearch first starts, it's unable to find a path to it's cluster members:

Code: Select all

[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host
What I think may be causing this is a delayed network connection. Basically, Elasticsearch is trying to join the cluster ahead of the machine actually being on the network.

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Wed May 10, 2017 3:43 pm
by ssoliveira
Yes, it is a group of 4 servers, in the same L2.

I rebuild the environment; in the laboratory; with 4 servers in CentOS 7, and I can replicate the problem.
After running reboot, I need to wait for the server to restart and then restart the elasticsearch.

Rebuild the environment; in the laboratory; with CentOS 6, and the problem does not occur.

All servers are 100% upgraded via yum.

The environment does not have customizations, following strictly as documentation of the nagios.

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Wed May 10, 2017 5:05 pm
by tacolover101
is ipv6 enabled? a thought that comes to mind, is it may be trying to connect to peers through this method.

the other problem could be port 9300 blocked somewhere between the nodes.

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Thu May 11, 2017 1:39 pm
by mcapra
I've walked through this in my lab environment with 4 nodes and am still unable to replicate. I do a hard-reset on each CentOS 7 VM, and all members rejoin the cluster without further issues:

Code: Select all

[root@nls1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a4e1a590-f811-4545-b7a9-c5b160afebc2",
  "nodes" : {
    "u1wZXAG7TpeZFKjg2-C_7w" : {
      "name" : "ac1679a4-b4f0-46ff-9203-b1277072c5c6",
      "transport_address" : "inet[/192.168.67.5:9300]",
      "host" : "nls2",
      "ip" : "192.168.67.5",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1751,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527541932,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "MiASaXZWQIiHibXz0NyhrQ" : {
      "name" : "4c832c80-f512-4cae-9c65-5159efe1a336",
      "transport_address" : "inet[/192.168.67.6:9300]",
      "host" : "nls3",
      "ip" : "192.168.67.6",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1999,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527544857,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "19aL6Jf-SkiEqapDsRVx1A" : {
      "name" : "d7bfa7b4-a76a-4871-a120-e3399e7f2492",
      "transport_address" : "inet[/192.168.67.4:9300]",
      "host" : "nls1",
      "ip" : "192.168.67.4",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/192.168.67.4:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1469,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542150,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "BFC4Rd_GTOm_tUmiCMgYpQ" : {
      "name" : "34127c70-8136-4ef6-896a-2010ea231c2a",
      "transport_address" : "inet[/192.168.67.7:9300]",
      "host" : "nls4",
      "ip" : "192.168.67.7",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1577,
        "version" : "1.7.0_141",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.141-b02",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542557,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}
Can we get all of the Elasticsearch logs from each instance? They're typically found here:

Code: Select all

/var/log/elasticsearch/

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Mon May 15, 2017 11:31 am
by ssoliveira
I'm going to restart the cluster again; And send the logs.
Follow the JVM settings

curl -XGET localhost:9200/_nodes/jvm?pretty

Code: Select all

[root@datalog-ugt-log1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
  "nodes" : {
    "jAGKtfkcTCG1qGDBedVgsA" : {
      "name" : "8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c",
      "transport_address" : "inet[/10.154.9.94:9300]",
      "host" : "datalog-utb-log2.servicos",
      "ip" : "10.154.9.94",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 18890,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865667666,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "ToOWGPogRZqbohZCnI3f8w" : {
      "name" : "765cc658-3e5f-4923-804e-5eb57735f761",
      "transport_address" : "inet[/10.154.9.93:9300]",
      "host" : "datalog-utb-log1.servicos",
      "ip" : "10.154.9.93",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 4504,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865549411,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "h8ilRIIsTsyUb7BQaZm9uw" : {
      "name" : "8471b9e1-1a82-4c3d-98bc-03f2ce871369",
      "transport_address" : "inet[/10.154.3.99:9300]",
      "host" : "datalog-ugt-log1.gtservicos",
      "ip" : "10.154.3.99",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 24531,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865394077,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "Z9Ei_i-8SFSGSzYY6F68Kw" : {
      "name" : "5c998cfb-0460-4e56-8697-83b65c086a13",
      "transport_address" : "inet[/10.154.3.100:9300]",
      "host" : "datalog-ugt-log2.gtservicos",
      "ip" : "10.154.3.100",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 27785,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865490822,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}

Re: Problem: Elasticsearch does not start correctly on boot

Posted: Mon May 15, 2017 2:32 pm
by cdienger
Thanks for the update and please keep us posted with the results/logs of the reboot.

The output provided would suggest the service was manually started on the localhost before the command was run, is this a correct assessment?