Nagios Support Forum

Posted: **Mon Aug 08, 2016 7:26 am**

10 host cluster - behind an A10 LB with health monitoring on each port.

I left Friday afternoon with 600+ hosts reporting in to my cluster and everything was happy.
I can see a clear decline in logs starting around 5pm, until eventually only some very basic ports were accepting data, and as of Monday morning, had 40+/- hosts reporting.
Health monitoring on my LB shows most of my ports not responding - NMap tests directly to each host corresponds to those same ports not responding.

Heading to Admin Dashboard/Apply Config to all hosts resolves the issue, and auto-magically about 2 minutes later I get my servers starting to report in again, and of course my log counts skyrocket. Over the next hour or so, my reporting hosts return to normal levels.

Nagios_Logs.png

This has happened several times over the past couple of weeks, and only seems to have started once we added our 10th host to the cluster - we were sitting at 8 for a good while.

IPTables is turned completely off on each host for the time-being while we troubleshoot.

See post https://support.nagios.com/forum/viewto ... 38&t=39556 for another issue with our 10 host cluster - wondering if there's just something with adding that 10th host that causes junk all around?

Any sort of logs/configs/screenshots I can provide to assist in troubleshooting?

Posted: **Mon Aug 08, 2016 9:42 am**

Can you post your logstash.log file for us to look at? /var/log/logstash/logstash.log

There should be some information on to why it's happening in there. It sounds like logstash is dying though. When you apply configuration, it brings it back up, and then dies once again.

Posted: **Mon Aug 08, 2016 9:53 am**

rkennedy wrote:Can you post your logstash.log file for us to look at? /var/log/logstash/logstash.log

There should be some information on to why it's happening in there. It sounds like logstash is dying though. When you apply configuration, it brings it back up, and then dies once again.

Its 700MB+ - do you have a file repository I can post to?

Posted: **Mon Aug 08, 2016 10:01 am**

Found one from earlier in the day that's a bit smaller 3MB
Attached here:

logstash.log-20160808.txt

Posted: **Mon Aug 08, 2016 10:39 am**

A tail would work as well. It looks like it's coming down to the sheer amount of connections / logs coming in.

Code: Select all

{:timestamp=>"2016-08-07T20:26:29.129000-0400", :message=>"An error occurred. Closing connection", :client=>"10.2.110.7:43330", :exception=>#<IOError: Too many open files>, :backtrace=>["org/jruby/RubyIO.java:2996:in `sysread'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:164:in `read'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:112:in `handle_socket'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:147:in `client_thread'", "/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:145:in `client_thread'"], :level=>:error}

Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?

Posted: **Mon Aug 08, 2016 10:41 am**

rkennedy wrote: Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?

I'm assuming that change will need to be made on all 10 nodes?

Posted: **Mon Aug 08, 2016 10:51 am**

jspink wrote:
rkennedy wrote: Please modify /etc/init.d/logstash and change LS_OPEN_FILES=16384 to LS_OPEN_FILES=65536. Then, run an Apply Configuration once again.

Let it run for a minute or so, and then I'd like to see what all is open, could you post the full output of lsof? (might need to attach it.)?
I'm assuming that change will need to be made on all 10 nodes?

Yeah, eventually. I'd like to see what is keeping things open though. Increasing this number might just be a bandaid for the real culprit.

Posted: **Mon Aug 08, 2016 11:13 am**

requested changes made and apply all done.

LSOF output attached:

output.txt

Posted: **Mon Aug 08, 2016 1:16 pm**

Weird, this isn't nearly the number I was expecting. After adjusting the ls_open_files variable, has that caused things to start working properly? Could you run the same command as before and attach the output again?

I'd like to compare the differences, to see if the open files has increased drastically.

Posted: **Mon Aug 08, 2016 1:23 pm**

rkennedy wrote:Weird, this isn't nearly the number I was expecting. After adjusting the ls_open_files variable, has that caused things to start working properly? Could you run the same command as before and attach the output again?

I'd like to compare the differences, to see if the open files has increased drastically.

It typically takes anywhere from a few hours to a couple days to see things start dropping off - so far, no issues other than some previously saved global dashboards not loading properly, but I don't believe that to be related.

Fresh output from lsof here:

output2.txt

Nagios Support Forum

TCP/UDP ports stop responding

TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding