Page 1 of 3

TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 7:26 am
by jspink
10 host cluster - behind an A10 LB with health monitoring on each port.

I left Friday afternoon with 600+ hosts reporting in to my cluster and everything was happy.
I can see a clear decline in logs starting around 5pm, until eventually only some very basic ports were accepting data, and as of Monday morning, had 40+/- hosts reporting.
Health monitoring on my LB shows most of my ports not responding - NMap tests directly to each host corresponds to those same ports not responding.

Heading to Admin Dashboard/Apply Config to all hosts resolves the issue, and auto-magically about 2 minutes later I get my servers starting to report in again, and of course my log counts skyrocket. Over the next hour or so, my reporting hosts return to normal levels.
Nagios_Logs.png
This has happened several times over the past couple of weeks, and only seems to have started once we added our 10th host to the cluster - we were sitting at 8 for a good while.

IPTables is turned completely off on each host for the time-being while we troubleshoot.

See post https://support.nagios.com/forum/viewto ... 38&t=39556 for another issue with our 10 host cluster - wondering if there's just something with adding that 10th host that causes junk all around?

Any sort of logs/configs/screenshots I can provide to assist in troubleshooting?

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 9:42 am
by rkennedy
Can you post your logstash.log file for us to look at? /var/log/logstash/logstash.log

There should be some information on to why it's happening in there. It sounds like logstash is dying though. When you apply configuration, it brings it back up, and then dies once again.

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 9:53 am
by jspink
rkennedy wrote:Can you post your logstash.log file for us to look at? /var/log/logstash/logstash.log

There should be some information on to why it's happening in there. It sounds like logstash is dying though. When you apply configuration, it brings it back up, and then dies once again.
Its 700MB+ - do you have a file repository I can post to?

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 10:01 am
by jspink
Found one from earlier in the day that's a bit smaller 3MB
Attached here:
logstash.log-20160808.txt

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 10:39 am
by rkennedy
A tail would work as well. It looks like it's coming down to the sheer amount of connections / logs coming in.

Code: Select all

{:timestamp=>"2016-08-07T20:26:29.129000-0400", :message=>"An error occurred. Closing connection", :client=>"10.2.110.7:43330", :exception=>#<IOError: Too many open files>, :backtrace=>["org/jruby/RubyIO.java:2996:in `sysread'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:164:in `read'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:112:in `handle_socket'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:147:in `client_thread'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:145:in `client_thread'"], :level=>:error}
Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 10:41 am
by jspink
rkennedy wrote: Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?
I'm assuming that change will need to be made on all 10 nodes?

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 10:51 am
by rkennedy
jspink wrote:
rkennedy wrote: Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?
I'm assuming that change will need to be made on all 10 nodes?
Yeah, eventually. I'd like to see what is keeping things open though. Increasing this number might just be a bandaid for the real culprit.

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 11:13 am
by jspink
requested changes made and apply all done.

LSOF output attached:
output.txt

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 1:16 pm
by rkennedy
Weird, this isn't nearly the number I was expecting. After adjusting the ls_open_files variable, has that caused things to start working properly? Could you run the same command as before and attach the output again?

I'd like to compare the differences, to see if the open files has increased drastically.

Re: TCP/UDP ports stop responding

Posted: Mon Aug 08, 2016 1:23 pm
by jspink
rkennedy wrote:Weird, this isn't nearly the number I was expecting. After adjusting the ls_open_files variable, has that caused things to start working properly? Could you run the same command as before and attach the output again?

I'd like to compare the differences, to see if the open files has increased drastically.

It typically takes anywhere from a few hours to a couple days to see things start dropping off - so far, no issues other than some previously saved global dashboards not loading properly, but I don't believe that to be related.

Fresh output from lsof here:
output2.txt