TCP/UDP ports stop responding

Post by **mcapra** » Thu Aug 11, 2016 4:39 pm

Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.

jspink · Post by **jspink** » Sun Aug 14, 2016 10:34 am

mcapra wrote:Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.

Dropped to 9 nodes - sometime between 4pm Friday and 11am Sunday, 6 of the remaining nodes stopped responding all together, so I had 3 nodes attempting to take in logs for everything.
Had to reboot all nodes, so lsof isn't going to be helpful, and since I have a ton of servers trying to catch up to 2 days of logs, i doubt a tail of the logstash log will help much either, but wanted to let you know it happened.

Post by **mcapra** » Mon Aug 15, 2016 12:33 pm

This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.

I have filed an internal bug report for this issue (ID 9305).

jspink · Post by **jspink** » Mon Aug 15, 2016 2:19 pm

mcapra wrote:This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.

I have filed an internal bug report for this issue (ID 9305).

cron jobs set - we had discussed doing this internally, but wanted to wait for your findings.

So with the bug report being entered, does this mean your devs will continue to look for a solution, or if the cron jobs resolve the issue, will it be left to stand?

Post by **mcapra** » Mon Aug 15, 2016 2:33 pm

As a bug report was submit, it's going to be addressed one way or another by them.

Let us know how the cron jobs handle this. If restarting logstash regularly solves the problem, then it's useful in terms of applying a proper fix at the logstash level.

jspink · Post by **jspink** » Fri Sep 16, 2016 12:02 pm

Just looking for a possible status update on this.

Scheduled reboots do seem to be helping, but would like to get back to our 10 instance cluster.

Post by **mcapra** » Fri Sep 16, 2016 12:15 pm

It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.

jspink · Post by **jspink** » Fri Sep 16, 2016 12:17 pm

mcapra wrote:It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.

ok - thanks for the quick response.
I'll wait for some work on this topic (https://support.nagios.com/forum/viewto ... 38&t=40282) before re-adding the 10th

Post by **mcapra** » Fri Sep 16, 2016 12:24 pm

Alrighty, will continue correspondence there

Nagios Support Forum

TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding

Re: TCP/UDP ports stop responding