Problem: Elasticsearch does not start correctly on boot

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
ssoliveira
Posts: 91
Joined: Wed Dec 07, 2016 6:02 pm

Problem: Elasticsearch does not start correctly on boot

Post by ssoliveira »

The "elasticsearch" service does not start correctly when restart the server.

Every time I restart the server, I need to manually restart the "elasticsearch" service.

Considerations:

This problem does not occur in CentOS 6
This problem only occurs when I install the NLS infrastructure in CentOS 7.

# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

Nagios Log Server: 1.4.4
Elasticsearch: 1.6.0
Logstash: 1.5.1
Kibana: 3.1.1-nagios3


##### After a reboot the elasticsearch service does not start automatically.
##### Displaying the following error.

[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[2671], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:20:30,234][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:20:30,254][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:20:30,341][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:20:34,574][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:20:34,736][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:20:34,750][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/OPCqK_rHRAmtCmGtOMb7mA
[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


##### After the problem occurs.
##### If I perform a restarts of elasticsearch with
##### the "systemctl restart elasticsearch" command, the service initializes correctly.
##### Apparently, there is some bug, or problem with the dependencies of the service.

[2017-05-05 18:30:22,255][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopping ...
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] stopped
[2017-05-05 18:30:22,606][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closing ...
[2017-05-05 18:30:22,622][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] closed
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] version[1.6.0], pid[6112], build[cdd3ac4/2015-06-09T13:36:34Z]
[2017-05-05 18:30:29,938][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initializing ...
[2017-05-05 18:30:29,950][INFO ][plugins ] [5c998cfb-0460-4e56-8697-83b65c086a13] loaded [knapsack-1.5.2.0-f340ad1], sites []
[2017-05-05 18:30:29,992][INFO ][env ] [5c998cfb-0460-4e56-8697-83b65c086a13] using [1] data paths, mounts [[/usr/local/datalog (/dev/mapper/vg_datalog-lv_datalog)]], net usable_space [15tb], net total_space [15.8tb], types [ext4]
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] initialized
[2017-05-05 18:30:32,566][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] starting ...
[2017-05-05 18:30:32,726][INFO ][transport ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/10.154.3.100:9300]}
[2017-05-05 18:30:32,737][INFO ][discovery ] [5c998cfb-0460-4e56-8697-83b65c086a13] a5726a09-769e-4f2b-be91-d786c8165c6f/bdD-q2t8Qpac9a1BLyxEGQ
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,430][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,478][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:34,479][DEBUG][action.admin.indices.create] [5c998cfb-0460-4e56-8697-83b65c086a13] no known master node, scheduling a retry
[2017-05-05 18:30:35,812][INFO ][cluster.service ] [5c998cfb-0460-4e56-8697-83b65c086a13] detected_master [8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}, added {[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[8471b9e1-1a82-4c3d-98bc-03f2ce871369][2KczjkYuStS0z83RbYH4mw][datalog-ugt-log1.gtservicos][inet[/10.154.3.99:9300]]{max_local_storage_nodes=1}])
[2017-05-05 18:30:35,934][INFO ][http ] [5c998cfb-0460-4e56-8697-83b65c086a13] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[localhost/127.0.0.1:9200]}
[2017-05-05 18:30:35,934][INFO ][node ] [5c998cfb-0460-4e56-8697-83b65c086a13] started

##### That may be trying to start while the network is not yet available (it's my suspicion).
##### I believe the version used is not natively compatible with systemd, which still uses chkconfig internally.

# Hypotheses

## Bug
## Problems with dependencies
## Bug related to installation in CentOS 7 (current), with all packages updated
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Problem: Elasticsearch does not start correctly on boot

Post by mcapra »

Is this a multi-instance environment?

This message implies that, when Elasticsearch first starts, it's unable to find a path to it's cluster members:

Code: Select all

[2017-05-05 18:20:37,782][WARN ][transport.netty ] [5c998cfb-0460-4e56-8697-83b65c086a13] exception caught on transport layer [[id: 0x447eaaab]], closing connection
java.net.NoRouteToHostException: No route to host
What I think may be causing this is a delayed network connection. Basically, Elasticsearch is trying to join the cluster ahead of the machine actually being on the network.
Former Nagios employee
https://www.mcapra.com/
ssoliveira
Posts: 91
Joined: Wed Dec 07, 2016 6:02 pm

Re: Problem: Elasticsearch does not start correctly on boot

Post by ssoliveira »

Yes, it is a group of 4 servers, in the same L2.

I rebuild the environment; in the laboratory; with 4 servers in CentOS 7, and I can replicate the problem.
After running reboot, I need to wait for the server to restart and then restart the elasticsearch.

Rebuild the environment; in the laboratory; with CentOS 6, and the problem does not occur.

All servers are 100% upgraded via yum.

The environment does not have customizations, following strictly as documentation of the nagios.
User avatar
tacolover101
Posts: 432
Joined: Mon Apr 10, 2017 11:55 am

Re: Problem: Elasticsearch does not start correctly on boot

Post by tacolover101 »

is ipv6 enabled? a thought that comes to mind, is it may be trying to connect to peers through this method.

the other problem could be port 9300 blocked somewhere between the nodes.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Problem: Elasticsearch does not start correctly on boot

Post by mcapra »

I've walked through this in my lab environment with 4 nodes and am still unable to replicate. I do a hard-reset on each CentOS 7 VM, and all members rejoin the cluster without further issues:

Code: Select all

[root@nls1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a4e1a590-f811-4545-b7a9-c5b160afebc2",
  "nodes" : {
    "u1wZXAG7TpeZFKjg2-C_7w" : {
      "name" : "ac1679a4-b4f0-46ff-9203-b1277072c5c6",
      "transport_address" : "inet[/192.168.67.5:9300]",
      "host" : "nls2",
      "ip" : "192.168.67.5",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1751,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527541932,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "MiASaXZWQIiHibXz0NyhrQ" : {
      "name" : "4c832c80-f512-4cae-9c65-5159efe1a336",
      "transport_address" : "inet[/192.168.67.6:9300]",
      "host" : "nls3",
      "ip" : "192.168.67.6",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1999,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527544857,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "19aL6Jf-SkiEqapDsRVx1A" : {
      "name" : "d7bfa7b4-a76a-4871-a120-e3399e7f2492",
      "transport_address" : "inet[/192.168.67.4:9300]",
      "host" : "nls1",
      "ip" : "192.168.67.4",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[/192.168.67.4:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1469,
        "version" : "1.7.0_121",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.121-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542150,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "BFC4Rd_GTOm_tUmiCMgYpQ" : {
      "name" : "34127c70-8136-4ef6-896a-2010ea231c2a",
      "transport_address" : "inet[/192.168.67.7:9300]",
      "host" : "nls4",
      "ip" : "192.168.67.7",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1577,
        "version" : "1.7.0_141",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.141-b02",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494527542557,
        "mem" : {
          "heap_init_in_bytes" : 1987051520,
          "heap_max_in_bytes" : 1953234944,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 1953234944
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}
Can we get all of the Elasticsearch logs from each instance? They're typically found here:

Code: Select all

/var/log/elasticsearch/
Former Nagios employee
https://www.mcapra.com/
ssoliveira
Posts: 91
Joined: Wed Dec 07, 2016 6:02 pm

Re: Problem: Elasticsearch does not start correctly on boot

Post by ssoliveira »

I'm going to restart the cluster again; And send the logs.
Follow the JVM settings

curl -XGET localhost:9200/_nodes/jvm?pretty

Code: Select all

[root@datalog-ugt-log1 ~]# curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "a5726a09-769e-4f2b-be91-d786c8165c6f",
  "nodes" : {
    "jAGKtfkcTCG1qGDBedVgsA" : {
      "name" : "8d4f2dfb-f10c-4655-a4b7-8b5eaa9f6a3c",
      "transport_address" : "inet[/10.154.9.94:9300]",
      "host" : "datalog-utb-log2.servicos",
      "ip" : "10.154.9.94",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 18890,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865667666,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "ToOWGPogRZqbohZCnI3f8w" : {
      "name" : "765cc658-3e5f-4923-804e-5eb57735f761",
      "transport_address" : "inet[/10.154.9.93:9300]",
      "host" : "datalog-utb-log1.servicos",
      "ip" : "10.154.9.93",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 4504,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865549411,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "h8ilRIIsTsyUb7BQaZm9uw" : {
      "name" : "8471b9e1-1a82-4c3d-98bc-03f2ce871369",
      "transport_address" : "inet[/10.154.3.99:9300]",
      "host" : "datalog-ugt-log1.gtservicos",
      "ip" : "10.154.3.99",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 24531,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865394077,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "Z9Ei_i-8SFSGSzYY6F68Kw" : {
      "name" : "5c998cfb-0460-4e56-8697-83b65c086a13",
      "transport_address" : "inet[/10.154.3.100:9300]",
      "host" : "datalog-ugt-log2.gtservicos",
      "ip" : "10.154.3.100",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 27785,
        "version" : "1.7.0_131",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.131-b00",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1494865490822,
        "mem" : {
          "heap_init_in_bytes" : 37913362432,
          "heap_max_in_bytes" : 37757386752,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 37757386752
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Problem: Elasticsearch does not start correctly on boot

Post by cdienger »

Thanks for the update and please keep us posted with the results/logs of the reboot.

The output provided would suggest the service was manually started on the localhost before the command was run, is this a correct assessment?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked