Page 1 of 1

!! Please - need immediate help - 2 of 4 systems down

Posted: Wed Aug 07, 2019 12:03 pm
by SteveBeauchemin
My system is not running on 2 of my 4 hosts. I am unable to get it running.

I could use some real help. I will even open a ticket to get some hands on help.

Please advise.

Steve B

The following is right after a reboot

Code: Select all

[2019-08-07 11:46:00,913][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] version[1.7.6], pid[2794], build[c730b59/2016-11-18T15:21:16Z]
[2019-08-07 11:46:00,917][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] initializing ...
[2019-08-07 11:46:01,035][INFO ][plugins                  ] [4c87ddc4-146d-45de-9730-5b229ba4b096] loaded [knapsack-1.7.3.0-d0ea246], sites []
[2019-08-07 11:46:01,073][INFO ][env                      ] [4c87ddc4-146d-45de-9730-5b229ba4b096] using [1] data paths, mounts [[/usr/local (/dev/mapper/VolGroup10-LogVol00)]], net usable_space [690.8gb], net total_space [799.6gb], types [xfs]
[2019-08-07 11:46:04,602][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] initialized
[2019-08-07 11:46:04,602][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] starting ...
[2019-08-07 11:46:04,931][INFO ][transport                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/131.198.86.106:9300]}
[2019-08-07 11:46:04,962][INFO ][discovery                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] 79e8bf76-674f-4ecd-8741-27a3587a3f39/24Xf7lY3Q_GUPWciyNmLVQ
[2019-08-07 11:46:05,122][WARN ][transport.netty          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] exception caught on transport layer [[id: 0x7f2389dd]], closing connection
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
and it just repeats the following over and over

Code: Select all

[2019-08-07 11:46:05,176][WARN ][transport.netty          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] exception caught on transport layer [[id: 0xc3b78af9]], closing connection
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
        at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:634)
        at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:216)
        at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229)
        at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182)
        at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:787)
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:754)
        at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:726)
        at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:220)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
and finally...

Code: Select all

[2019-08-07 11:46:08,107][INFO ][cluster.service          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] detected_master [e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1}, added {[e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1},[bc61f6af-6e43-449d-9fca-16658531110b][bwjIbP_bQoGCKpSnzVSf7Q][dtulnls01.rockwellcollins.com][inet[/10.55.131.46:9300]]{max_local_storage_nodes=1},[d7d08025-52f9-44ca-af64-0beca7c2f116][HCkcKcaeTUuFhHgvjg93XQ][ciulnls01.rockwellcollins.com][inet[/10.54.131.32:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1}])
[2019-08-07 11:46:08,343][ERROR][bootstrap                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] Exception
org.elasticsearch.http.BindHttpException: Failed to bind to [9200]
        at org.elasticsearch.http.netty.NettyHttpServerTransport.doStart(NettyHttpServerTransport.java:269)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.http.HttpServer.doStart(HttpServer.java:89)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.node.internal.InternalNode.start(InternalNode.java:274)
        at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:160)
        at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:248)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: org.elasticsearch.common.netty.channel.ChannelException: Failed to bind to: localhost/127.0.0.1:9200
        at org.elasticsearch.common.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
        at org.elasticsearch.http.netty.NettyHttpServerTransport$1.onPortNumber(NettyHttpServerTransport.java:260)
        at org.elasticsearch.common.transport.PortsRange.iterate(PortsRange.java:58)
        at org.elasticsearch.http.netty.NettyHttpServerTransport.doStart(NettyHttpServerTransport.java:256)
        ... 7 more
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:433)
        at sun.nio.ch.Net.bind(Net.java:425)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[2019-08-07 11:46:08,354][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] stopping ...
[2019-08-07 11:46:08,425][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] stopped
[2019-08-07 11:46:08,425][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] closing ...
[2019-08-07 11:46:08,450][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.execute(DefaultChannelPipeline.java:636)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)
        at org.elasticsearch.common.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:46)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.notifyHandlerException(DefaultChannelPipeline.java:658)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:577)
        at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:704)
        at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:671)
        at org.elasticsearch.common.netty.channel.AbstractChannel.write(AbstractChannel.java:348)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:105)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:76)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:292)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:283)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[2019-08-07 11:46:08,453][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.execute(DefaultChannelPipeline.java:636)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)

Re: !! Please - need immediate help - 2 of 4 systems down

Posted: Wed Aug 07, 2019 12:28 pm
by cdienger
Something appears to be bound to the 9200 port. On both machines run the following and make sure it is down before starting again:

Code: Select all

service elasticsearch stop
netstat -nap | grep 9200
if anything is still using 9200 after stopping the service then the netstat command will show us what and can be stopped with a "kill <PID>" from the command line.

If the problem persists, please provide copies of /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml profiles from the systems. Profiles be gathered under Admin > System > System Status > Download System Profile or from the command line with:

/usr/local/nagioslogserver/scripts/profile.sh

This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. This is usually due to the logs in the Logstash and/or Elasticseach directories found in it. If it is too large, please open the profile, extract these directories/files and send them separately.

Re: !! Please - need immediate help - 2 of 4 systems down

Posted: Wed Aug 07, 2019 1:06 pm
by SteveBeauchemin
It is okay now - but not at all what I expected. There was a mysql process that was looping like crazy and spamming the local /var/log/messages file. When I finally spotted that, and removed the xinetd.d file that made it run, the system started to behave. I have no idea why port 9200 was tied up, but it was. With elasticsearch stopped, port 9200 did not let go.

Just to fill in some blanks...

This is what I had been seeing
nls-config-fail.PNG
Which led me to look at the logs that I posted.
Then I had my UNIX Admin reboot both servers that were acting up. And that didn't fix it.

Since I just spent the last 4 weeks head down on this stuff and got it working perfectly I gotta say that this really had me worried.

But I'm good now and can breathe again. All just so my folks can see cool stuff like this - which is using my local in-house IP addresses, and fails to external addresses if needed.
NLS-Example-Local-IP.PNG
Thanks for the assist. I did use your suggestion at the end and saw port 9200 became open.

I'll be transplanting that mysql to other servers. I need to be able to sleep at night. whew...

Steve B

Re: !! Please - need immediate help - 2 of 4 systems down

Posted: Wed Aug 07, 2019 1:26 pm
by cdienger
Glad to hear!