!! Please - need immediate help - 2 of 4 systems down

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

!! Please - need immediate help - 2 of 4 systems down

Post by SteveBeauchemin »

My system is not running on 2 of my 4 hosts. I am unable to get it running.

I could use some real help. I will even open a ticket to get some hands on help.

Please advise.

Steve B

The following is right after a reboot

Code: Select all

[2019-08-07 11:46:00,913][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] version[1.7.6], pid[2794], build[c730b59/2016-11-18T15:21:16Z]
[2019-08-07 11:46:00,917][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] initializing ...
[2019-08-07 11:46:01,035][INFO ][plugins                  ] [4c87ddc4-146d-45de-9730-5b229ba4b096] loaded [knapsack-1.7.3.0-d0ea246], sites []
[2019-08-07 11:46:01,073][INFO ][env                      ] [4c87ddc4-146d-45de-9730-5b229ba4b096] using [1] data paths, mounts [[/usr/local (/dev/mapper/VolGroup10-LogVol00)]], net usable_space [690.8gb], net total_space [799.6gb], types [xfs]
[2019-08-07 11:46:04,602][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] initialized
[2019-08-07 11:46:04,602][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] starting ...
[2019-08-07 11:46:04,931][INFO ][transport                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/131.198.86.106:9300]}
[2019-08-07 11:46:04,962][INFO ][discovery                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] 79e8bf76-674f-4ecd-8741-27a3587a3f39/24Xf7lY3Q_GUPWciyNmLVQ
[2019-08-07 11:46:05,122][WARN ][transport.netty          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] exception caught on transport layer [[id: 0x7f2389dd]], closing connection
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
and it just repeats the following over and over

Code: Select all

[2019-08-07 11:46:05,176][WARN ][transport.netty          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] exception caught on transport layer [[id: 0xc3b78af9]], closing connection
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:101)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
        at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:634)
        at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:216)
        at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229)
        at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182)
        at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:787)
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:754)
        at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:726)
        at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:220)
        at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:373)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
and finally...

Code: Select all

[2019-08-07 11:46:08,107][INFO ][cluster.service          ] [4c87ddc4-146d-45de-9730-5b229ba4b096] detected_master [e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1}, added {[e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1},[bc61f6af-6e43-449d-9fca-16658531110b][bwjIbP_bQoGCKpSnzVSf7Q][dtulnls01.rockwellcollins.com][inet[/10.55.131.46:9300]]{max_local_storage_nodes=1},[d7d08025-52f9-44ca-af64-0beca7c2f116][HCkcKcaeTUuFhHgvjg93XQ][ciulnls01.rockwellcollins.com][inet[/10.54.131.32:9300]]{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [[e0985c3d-398c-44bb-8ad7-0177911c5a6a][F27LyfzzS5CfdD_fZkpyEg][crulnls02.rockwellcollins.com][inet[/131.198.86.107:9300]]{max_local_storage_nodes=1}])
[2019-08-07 11:46:08,343][ERROR][bootstrap                ] [4c87ddc4-146d-45de-9730-5b229ba4b096] Exception
org.elasticsearch.http.BindHttpException: Failed to bind to [9200]
        at org.elasticsearch.http.netty.NettyHttpServerTransport.doStart(NettyHttpServerTransport.java:269)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.http.HttpServer.doStart(HttpServer.java:89)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.node.internal.InternalNode.start(InternalNode.java:274)
        at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:160)
        at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:248)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: org.elasticsearch.common.netty.channel.ChannelException: Failed to bind to: localhost/127.0.0.1:9200
        at org.elasticsearch.common.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
        at org.elasticsearch.http.netty.NettyHttpServerTransport$1.onPortNumber(NettyHttpServerTransport.java:260)
        at org.elasticsearch.common.transport.PortsRange.iterate(PortsRange.java:58)
        at org.elasticsearch.http.netty.NettyHttpServerTransport.doStart(NettyHttpServerTransport.java:256)
        ... 7 more
Caused by: java.net.BindException: Address already in use
        at sun.nio.ch.Net.bind0(Native Method)
        at sun.nio.ch.Net.bind(Net.java:433)
        at sun.nio.ch.Net.bind(Net.java:425)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[2019-08-07 11:46:08,354][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] stopping ...
[2019-08-07 11:46:08,425][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] stopped
[2019-08-07 11:46:08,425][INFO ][node                     ] [4c87ddc4-146d-45de-9730-5b229ba4b096] closing ...
[2019-08-07 11:46:08,450][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.execute(DefaultChannelPipeline.java:636)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)
        at org.elasticsearch.common.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:46)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.notifyHandlerException(DefaultChannelPipeline.java:658)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:577)
        at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:704)
        at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:671)
        at org.elasticsearch.common.netty.channel.AbstractChannel.write(AbstractChannel.java:348)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:105)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:76)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:292)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:283)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[2019-08-07 11:46:08,453][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:72)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.executeInIoThread(AbstractNioWorker.java:56)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.executeInIoThread(NioWorker.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioChannelSink.execute(AbstractNioChannelSink.java:34)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.execute(DefaultChannelPipeline.java:636)
        at org.elasticsearch.common.netty.channel.Channels.fireExceptionCaughtLater(Channels.java:496)
Last edited by SteveBeauchemin on Wed Aug 07, 2019 1:11 pm, edited 1 time in total.
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: !! Please - need immediate help - 2 of 4 systems down

Post by cdienger »

Something appears to be bound to the 9200 port. On both machines run the following and make sure it is down before starting again:

Code: Select all

service elasticsearch stop
netstat -nap | grep 9200
if anything is still using 9200 after stopping the service then the netstat command will show us what and can be stopped with a "kill <PID>" from the command line.

If the problem persists, please provide copies of /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml profiles from the systems. Profiles be gathered under Admin > System > System Status > Download System Profile or from the command line with:

/usr/local/nagioslogserver/scripts/profile.sh

This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. This is usually due to the logs in the Logstash and/or Elasticseach directories found in it. If it is too large, please open the profile, extract these directories/files and send them separately.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: !! Please - need immediate help - 2 of 4 systems down

Post by SteveBeauchemin »

It is okay now - but not at all what I expected. There was a mysql process that was looping like crazy and spamming the local /var/log/messages file. When I finally spotted that, and removed the xinetd.d file that made it run, the system started to behave. I have no idea why port 9200 was tied up, but it was. With elasticsearch stopped, port 9200 did not let go.

Just to fill in some blanks...

This is what I had been seeing
nls-config-fail.PNG
Which led me to look at the logs that I posted.
Then I had my UNIX Admin reboot both servers that were acting up. And that didn't fix it.

Since I just spent the last 4 weeks head down on this stuff and got it working perfectly I gotta say that this really had me worried.

But I'm good now and can breathe again. All just so my folks can see cool stuff like this - which is using my local in-house IP addresses, and fails to external addresses if needed.
NLS-Example-Local-IP.PNG
Thanks for the assist. I did use your suggestion at the end and saw port 9200 became open.

I'll be transplanting that mysql to other servers. I need to be able to sleep at night. whew...

Steve B
You do not have the required permissions to view the files attached to this post.
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: !! Please - need immediate help - 2 of 4 systems down

Post by cdienger »

Glad to hear!
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked