Page 3 of 3
Re: TCP/UDP ports stop responding
Posted: Thu Aug 11, 2016 4:39 pm
by mcapra
Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Re: TCP/UDP ports stop responding
Posted: Sun Aug 14, 2016 10:34 am
by jspink
mcapra wrote:Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Dropped to 9 nodes - sometime between 4pm Friday and 11am Sunday, 6 of the remaining nodes stopped responding all together, so I had 3 nodes attempting to take in logs for everything.
Had to reboot all nodes, so lsof isn't going to be helpful, and since I have a ton of servers trying to catch up to 2 days of logs, i doubt a tail of the logstash log will help much either, but wanted to let you know it happened.
Re: TCP/UDP ports stop responding
Posted: Mon Aug 15, 2016 12:33 pm
by mcapra
This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225
Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.
I have filed an internal bug report for this issue (ID 9305).
Re: TCP/UDP ports stop responding
Posted: Mon Aug 15, 2016 2:19 pm
by jspink
mcapra wrote:This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225
Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.
I have filed an internal bug report for this issue (ID 9305).
cron jobs set - we had discussed doing this internally, but wanted to wait for your findings.
So with the bug report being entered, does this mean your devs will continue to look for a solution, or if the cron jobs resolve the issue, will it be left to stand?
Re: TCP/UDP ports stop responding
Posted: Mon Aug 15, 2016 2:33 pm
by mcapra
As a bug report was submit, it's going to be addressed one way or another by them.
Let us know how the cron jobs handle this. If restarting logstash regularly solves the problem, then it's useful in terms of applying a proper fix at the logstash level.
Re: TCP/UDP ports stop responding
Posted: Fri Sep 16, 2016 12:02 pm
by jspink
Just looking for a possible status update on this.
Scheduled reboots do seem to be helping, but would like to get back to our 10 instance cluster.
Re: TCP/UDP ports stop responding
Posted: Fri Sep 16, 2016 12:15 pm
by mcapra
It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
Re: TCP/UDP ports stop responding
Posted: Fri Sep 16, 2016 12:17 pm
by jspink
mcapra wrote:It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
ok - thanks for the quick response.
I'll wait for some work on this topic (
https://support.nagios.com/forum/viewto ... 38&t=40282) before re-adding the 10th
Re: TCP/UDP ports stop responding
Posted: Fri Sep 16, 2016 12:24 pm
by mcapra
Alrighty, will continue correspondence there